This is an update on our progress towards our goals over the last ten months. If you can only read 650 characters of this update, like the judges in our experiments, here’s what you need to know:
We believe that the most important questions facing humanity are complex and open-ended. These questions range from “What types of policies will effectively curb climate change?” to “How should we deal with the potentially transformative impacts of AI?” and “What career should I pursue?” Despite their importance, such open-ended questions are answered poorly today.
As AI plays a bigger role in society, the world will likely get more complex. It will become even more important to give good answers to questions like these. Unfortunately, AI is not on track to help substantially with answering these open-ended questions. So far, we only know how to use AI to help us with tasks that have clear metrics or fast empirical feedback loops.
Our mission is to make AI just as useful for open-ended questions. Figuring out how to direct the most powerful technologies of our time to the most important questions society wrestles with is a highly leveraged way to have a large, positive impact. Rather than directly tackling climate change, or poverty, or animal suffering, we’re improving the process by which decisions on all of these issues get made.
To apply AI to questions like these, we design, test, and implement mechanisms for delegating open-ended cognitive work to experts who are only trying to optimize clear feedback signals. Our work today involves running experiments with human participants, building web apps to gather data from and structure the experiments, and connecting what we learn from human experiments to ML training. Over time, we'll incrementally automate the work of our human participants and build a platform that deploys ML to answer open-ended questions.
We ended our last update with the following goals for the first half of 2019:
We’ve done 1 and 3, parts of 2, and most of 4. We’re still hiring for an engineering team lead!
As of our last update, we were running factored generation experiments. In these experiments, participants break down a complex task into easier tasks, delegate the easier tasks, and use the solutions to these tasks to complete the larger task.
For example, a participant in a factored generation experiment might get the question “What are all of the nouns in the sentence below?” and they would have to return a list of nouns.
Since March of this year, we've switched to running factored evaluation experiments, another instance of factored cognition. Instead of breaking down the original task to complete it, we instead break down the evaluation of solutions to the task.
The factored evaluation version of the question above looks like: “Is ‘dog’ or ‘cat’ a better answer to the question ‘What are all of the nouns in the sentence below?’” and the participant chooses one of the answers.
There are a few reasons why we concluded that factored evaluation is a better research direction for now:
Deployed systems will likely use both factored generation and evaluation, or different mechanisms altogether that combine the best of both worlds, so we expect to test a combination of both in the future.
Our factored evaluation experiments study how a weak human judge can supervise experts who know more and have more time to think. We’re starting with reading comprehension questions where we create an artificial gap between a judge (who doesn’t get to read a text) and experts (who do).
Skipping over a few details, our experiments have the following structure:
Whenever the judge asks the experts a question, this generates a new experiment: Now a different judge must decide which of two expert answers to that question is honest and which is malicious, using the same recursive process. For this to terminate, eventually a judge must choose an answer without asking any subquestions.
Ultimately, we want to show that this approach to evaluating experts’ answers allows the judge to identify the correct answer even when they have limited access to information. To do this, we have to show that the judge selects the honest answer consistently, even when the malicious expert is very strong or has a lot of advantages.
We haven’t yet shown that factored evaluation works reliably in this reading comprehension context. Over the past few months of testing factored evaluation, we’ve learned that:
Our factored evaluation experiments were originally structured like a game. Our initial hypothesis was that if we could establish the rules of a game and provide participants with feedback, then strategies for what questions to ask in what context would emerge organically as more people tried to win. While we’ve seen some interesting norms and tactics emerge, we’ve concluded that this is not the most effective way to develop strategies because:
Many of the decomposition strategies that participants tried were overfit to the specific rules and the Pitchfork reading comprehension context (understandably so). But Ought cares about strategies that show promise of generalizing to multiple contexts, especially to evaluating answers to open-ended questions.
It's difficult to come up with good strategies, for both participants and Ought employees. We’ve updated towards thinking that a few people need to spend concentrated time working on this, not intermittent time here and there. Additionally, generating promising strategies benefits from a background in probability and logic.
Some of the strategies we've seen so far (designed for internal consumption):
Idea: Split each of the expert answers into distinct claims, have the experts provide evidence for each claim and arguments about the relative goodness of evidence.
Malicious Credences (Liam Donovan)
Idea: Ask the experts to state probabilities on which answer is better, and which aspects of each answer are likely to be misleading; ask them to provide evidence that will most help the judge update their beliefs.
What procedure should I use? (Paul Christiano, William Saunders)
Idea: Always ask the experts "What procedure should I use to decide which of the two answers is better?" and then implement that procedure.
Idea: Ask the experts to produce a complete hierarchical argument in a standardized format for each of the answers upfront, then check the two arguments.
While none of these strategies seem sufficient on their own, we think that they all contain seeds for promising future strategies.
No matter how skilled or well-trained a participant is, they will still make mistakes. They might sometimes misinterpret the question or the text and accidentally return or select an incorrect answer to the question they’re working on. If we want to eventually deploy our question-answering systems in high-stakes situations, we can’t afford these mistakes. We’ve realized the need for building systems and processes that aggregate answers across participants to ensure that the malicious expert never wins simply because someone made an avoidable mistake. As a first step, we're looking into ensembling: we give the same subtask to multiple people and then use a voting process to determine which of their answers to go with.
By default, trees don't just get excessively large in factored generation, but in factored evaluation as well. Here is an extreme example of a tree that went on for over 2 months without completing i.e. without an answer being selected at the root level. In this tree, 42 subquestions were asked, with 195 assignments worked on by 22 different participants.
Trees need to finish for us to validate whether a strategy reliably leads to the right answer being selected. To resolve this issue, we rolled out a version of the game we call Most Interesting Branch. In Most Interesting Branch, the honest and malicious expert agree to explore just one path down the question-answer tree, and to yield to the other’s answer for the rest of the tree.
Over the next months, we’ll focus on developing strategies in house and testing factored evaluation more modularly. Instead of trying to get participants to come up with promising strategies by playing and improving at a game, Ought employees will devise strategies that we think should consistently select the right answers and generalize beyond the reading comprehension context. We’ll then test strategy execution in a more incremental fashion, starting with single-layer trees and producing robust guarantees at each step along the way.
We value getting feedback on different approaches to experimentation. We’ve assembled an experiment review board of 10 academics, including professors at Stanford, UCSD, Berkeley, Harvard, ANU, and Wharton. We trust their judgement on experimentation and think that such a board will help us run experiments more rigorously as well as broaden the reach of our research. If you have thoughts on how we can run better experiments, reach out to us at [email protected].
Today, machine learning systems are not advanced enough to do open-ended reasoning, so we're primarily running experiments with human participants. Longer-term, we’ll automate the work of participants in the experiments described above, such that the decompositions, expert answers, and answer evaluations are all produced by machine learning systems.
To ensure that our research with human participants doesn’t deviate too far from what is needed in the future to work with ML, and to better estimate when all of this work can be automated, we ran the following projects:
First, we took the Complex Web Questions dataset, which contains questions like this:
We built an end-to-end system using GPT-2 that breaks the questions into subquestions, queries Google to answer each of the subquestions, and aggregates the answers back together to answer the original question. Currently, our system answers about 30% of the questions in CWQ correctly.
We also started compiling our own dataset of numerical estimation questions, questions like:
We learned that this dataset needs to be highly structured for GPT-2 to learn how to break down the initial question into subquestions based on human demonstrations. Currently, our data format looks like this:
|Question||How many cells are in an adult Paedophryne amauensis frog?|
|Formalization||number||cells in an adult Paedophryne amauensis frog|
|A1||volume||adult Paedophryne amauensis frog|
|A2||volume||cell in adult Paedophryne amauensis frog|
|Aggregation||A1 / A2|
In this dataset, our current ML predictions match human decomposition steps 15% of the time on our validation set.
For both numerical estimation and Complex Web Questions, we view these results as initial (weak) evidence that fine-tuning general-purpose language models on decomposition tasks might be promising. To better understand how true this is, we'd like to study in future work cases where getting decompositions right requires world knowledge that the model has learned in its unsupervised pretraining phase.
Over time we'd like to estimate quantitatively how much data we need to automate the work our participants do. As a first step in that direction, we explored the effects studied in the Hestness et al (2017) paper “Deep learning scaling is predictable, empirically”. Hestness et al showed that, across a number of domains including image classification, language modeling, and speech recognition, there is a power-law relationship between dataset size and validation loss. That is, to halve validation loss you need some constant k times more data, where k depends on the task.
We replicated their results using transformer models on small decomposition tasks (Complex Web Questions, numerical estimation, math word problems). Calculating k for numerical estimation tasks of different kinds based on small-scale initial data collection helped us converge on the structured data format above. If you're training language models and are deciding what kind of data to collect, you might want to run a similar exercise to estimate ahead of time how much data you'd need to achieve a particular validation loss.
Ought’s engineering team owns Mosaic, the web app we use to gather data from our experiment participants. Mosaic was initially built around the factored generation experiments we mentioned earlier, which makes it suboptimal for our current experiments. We’re excited about running many different types of question-and-answering experiments in the future, so we’ve started working on Mosaic2.
Mosaic2 is a more flexible web app that simplifies setting up varied experiment mechanisms. In Mosaic2, teams can specify the types of interactions they want to have with experiment participants. Without building separate apps, they can easily run factored evaluation, factored generation, or debate experiments.
Mosaic2 is still under development, but we’re excited to launch it soon! If the idea of building an app that structures and aggregates the thinking of a large crowd of participants excites you, check out this opportunity to lead the team building it.
The following contractors and collaborators also contributed to our work:
Special thanks to:
Since December 2018, we’ve received generous donations from the following people and institutions:
If you’d like to help with our work, you can:
For more updates like this one, sign up for our newsletter.