A set of tasks for evaluating scalable problem solving
How can we decide which mechanisms for factored cognition to implement, and how can we evaluate them once implemented? Given that our goal is to find mechanisms that are plausibly scalable, we now outline a set of tasks that would provide evidence to that extent if a system solves them using only short-term human work.
Aims for tasks
These tasks should have the following properties:
- Each task exercises one or more essential capabilities of the system. These include:
- Reasoning with external knowledge that is too large for individual instances of H to handle
- Reasoning incrementally and explicitly in cases that humans would approach holistically and implicitly by default
- Working with concepts that no individual H can learn given the available time
- Interaction with the external world, both when it is treated as stateful (e.g., generating dialog) and when not (e.g., running computations)
- Reasoning not just about object-level ideas, but also about what cognitive strategies to employ
- If the system succeeds at all tasks, it is likely that is has sufficient capability to solve much more complex tasks given a sufficiently large budget for calls to H. I don’t think we can hope to be confident in universality, even if the system solves all tasks we come up with. New failure modes—such as security failures of H—may show up with very large budgets.
- The tasks are not unnecessarily complex. That is, if there is a task that tests the same properties while requiring fewer calls to H, we would rather pick that task.
- The tasks are sufficiently rich that we can observe interesting changes in the quality of solutions as we increase the budget the system has to work with.
Here are some tasks I’m currently considering. Each task exercises all of the capabilities 1a-1e to some extent. To illustrate some of the motivation, I will still highlight a single capability that is especially exercised by that task:
- Answering questions about books: “Why didn’t Harry kiss Sally?”
Reasoning with large external knowledge bases
- Fact checking: “Is it true that [proposition]?”
Reasoning incrementally and explicitly
- Early math textbook exercises: “Show that there are no wffs of length 6.”
Working with concepts that no individual H can learn
- Cost-benefit analysis: “Which airbnb should I stay at?”
Interaction with a stateful world (the asker)
- Prioritizing todos: “What should I work on next?”
Reasoning about cognitive strategies
Of these, math textbook exercises are probably by far the hardest.
The last two tasks require an approach to interaction, so that our system can ask a limited number of follow-up questions to elicit personal information that is required to produce a good solution.