Ought is an applied machine learning lab. We’re building Elicit, the AI research assistant. Our mission is to automate and scale open-ended reasoning. To get there, we train language models by supervising reasoning processes, not outcomes. This is better for reasoning capabilities in the short run and better for alignment in the long run.
In this post, we review the progress we’ve made over the last year and lay out our plan.
Progress in 2021:
Roadmap for 2022+:
Our mission is to automate and scale open-ended reasoning. If we can improve the world’s ability to reason, we’ll unlock positive impact across many domains including AI governance & alignment, psychological well-being, economic development, and climate change.
As AI advances, the raw cognitive capabilities of the world will increase. The goal of our work is to channel this growth toward good reasoning. We want AI to be more helpful for qualitative research, long-term forecasting, planning, and decision-making than for persuasion, keeping people engaged, and military robotics.
Good reasoning is as much about process as it is about outcomes. In fact, outcomes are unavailable if we’re reasoning about the long term. So we’re generally not training machine learning models end-to-end using outcome data, but building Elicit compositionally and inspired by human processes.
In the short term, supervising process is necessary for AI to help with tasks where it’s difficult to evaluate the work from results alone. In the long term, process-based systems can avoid alignment risks introduced by end-to-end training.
Success for us looks like this:
Because we’re betting on process-based architectures, these two success criteria are fundamentally intertwined.
We’ve decided to start by supporting researchers for the following reasons:
We’re studying researchers and how they discover, evaluate, and generate knowledge. Within research, we chose an initial workflow (literature review, mostly for empirical research) and will expand to other workflows and question types. Eventually, we’ll surface the building blocks of many cognitive tasks so that users can automate their own reasoning processes.
Today, Elicit uses language models to automate parts of literature review, helping people answer questions with academic literature. Researchers use Elicit to find papers, ask questions about them, and summarize their findings.
We started with the literature review workflow for a few reasons:
The literature review workflow in Elicit composes together about 10 subtasks, including:
Outside of the literature review workflow, versions of some of these subtasks also exist independently on Elicit and researchers find them useful.
Elicit is still early. We’ve spent about seven months building the literature workflow. Its impact on helping the world reason better, and on demonstrating a process-based ML architecture, is understandably small. Nevertheless, we’re excited about the reception so far and the potential to significantly scale its impact over the coming years.
Over 1,500 people use Elicit each month. Over 150 people use Elicit for more than 5 days each month (~ once a week). 60% of users in a month are returning users, people who used Elicit in a previous month and found it worth using again. In our February feedback survey, 45% of respondents said they would be “very disappointed” if Elicit went away. (Tech companies try to get this to 40%.) Elicit has been growing by word of mouth, and we expect to continue growing organically while we focus on making Elicit useful.
Today’s users primarily use Elicit to find papers and define research questions at the start of their research projects. 40% of respondents to our February feedback survey shared that they most want Elicit to help them with these tasks, and that Elicit is more useful for these tasks (7.8 and 7.1 out of 10) than for the others we asked about.
Elicit users also want help understanding paper contents and conducting systematic reviews, but Elicit was less helpful there at the time. (Understanding paper content is now a Q2 priority.)
Some of our most engaged researchers report using Elicit to find initial leads for papers, answer questions, and get perfect scores on exams (via Elicit Slack). One researcher used a combination of Elicit literature review, rephrase, and summarization tasks to compile a literature review for publication. Our Twitter page shows more examples of researcher feedback and how people are using Elicit.
At least 8% of users are explicitly affiliated with rationality or effective altruism, based on how they heard about Elicit or where they work. We also worked closely with CSET, whose researchers cited Elicit in three publications (Harnessed Lightning, Wisdom of the Crowd as Arbiter of Expert Disagreement, Classifying AI Systems).
In sum, people are using Elicit regularly and recommending it to others. We take this as a sign that Elicit is creating value. We’re excited for the day when we can make stronger claims about the impact Elicit is having on people’s reasoning. We plan to experiment with different evaluations of Elicit’s impact. Some ideas we’ve had in this direction:
Because Elicit is a process-based architecture, we need to get good at running complex task pipelines and at making sure the individual tasks within the pipelines are reliable. We’ve made progress on both fronts over the past year.
We’ve built a task graph execution framework for efficiently running compositions of language model tasks. The framework is used to run literature review tasks and is likely one of the most compositional uses of language models in the world. Elicit engineers only need to specify how tasks depend on other tasks (e.g. claim extraction depends on ranking), and the scheduling and execution across compute nodes happen automatically.
The execution engine runs the graph of tasks in parallel as efficiently as allowed by the dependency structure of the workflow graph. While running, the executor streams back partial results to the Elicit frontend. Because language models are relatively slow (more than one second per query for the largest models), parallelism and sending partial results both matter for a good user experience.
To get good overall answers, we also need individual primitive tasks to be robust. In a project in Q4 2021, we focused on generating one-sentence answers based on abstracts as a case study. When a researcher asks a question, Elicit finds relevant papers, reads the abstracts, then generates a one-sentence summary of the abstract that answers the researcher’s question. These summaries are often more relevant to the researcher’s specific question than any one of the sentences in the abstract.
With few-shot learning, we found that the claims were often irrelevant, hard to understand, and sometimes hallucinated, i.e. not supported by the abstract. This is a case of “capable but unaligned.” GPT-3 has the entire abstract, which contains all of the information it needs to generate a summary answer. We’re confident that GPT-3 is capable of generating such answers—it could even just pick the most relevant sentence and return it word for word. Nonetheless, it sometimes made things up.
As one of the first users of GPT-3 finetuning, we switched from few-shot learning to a finetuned claim generation model. This made the claims more relevant and easier to understand, but initially made hallucination worse. Through a sequence of finetuning runs on increasingly higher-quality datasets, we reduced hallucination without making claims less relevant. We still haven’t fully solved this problem. We expect that our upcoming work on verifier models, decomposition, and human feedback will help.
This roadmap highlights the most important themes for Elicit over the next years. A more fleshed-out roadmap is in this doc.
To date, we’ve focused on making Elicit useful for getting a broad overview of a research space, surveying many papers. Next, we will help researchers as they go deep into individual research papers and use those subtasks to support more complex reasoning.
Over the next months, we’ll work on projects like:
As we help users with more complex reasoning, we’ll need to get better at automatic decomposition, aggregating the results of subtasks, and understanding what users are really looking for. This will make Elicit more useful for more complex research (differential capabilities) and shed light on the feasibility of process-based architectures (alignment).
Here are two examples of how Elicit might automatically decompose complex tasks:
Right now, Elicit works best for questions about empirical research. Those tend to be questions of the style “What are the effects of X on Y?”, including questions about randomized controlled trials in biomedicine, social science, or economics.
Starting in late 2022, we want to move beyond literature review for empirical questions and let users automate custom workflows, initially within research. Elicit will become a workspace where users can invoke and combine tasks like search, classification, clustering, and brainstorming over datasets of their choice, with different models and interfaces.
For example, researchers might want to search over their own corpus from a reference manager, extract all of the outstanding research directions from the papers they’ve curated, rephrase them as questions, then search those questions over academic databases to see if any of them have been worked on.
They might connect their personal notetaking apps, classify all of the notes about papers, then train a model to watch the literature and notify them if new papers addressing any of their cruxes are published.
To ensure users have the tools they need to design their personal research assistants, we’ll work on projects like:
We’ll keep refining the core subtasks underlying many research workflows. This entails both task-specific work, such as building out search infrastructure for academic articles, as well as general-purpose human feedback mechanisms.
One of our biggest projects right now is building a semantic search engine for 200 million abstracts and 66 million full-text papers using language model embeddings.
On the human-feedback side, we’ll apply and contribute to methods for alignment. For example:
When we run into problems automating a task, we always want to understand whether this is because of limited data or limited model capacity. We are confident that model capacity will improve over time, and are primarily concerned with providing the data and training objective that will make good use of the available capacity at any point in time.
In the ideal world, the only constraint for new workflows is the compute time for running language models. To compete with end-to-end training, running new workflows using decomposition needs to have near-zero friction. This requires that we can run complex task pipelines, add new tasks with little effort, and efficiently gather human demonstrations and feedback.
We’ll build the infrastructure to execute very large graphs of tasks and deal with the challenges that come up in this setting, such as:
Adding new primitive tasks is labor-intensive. We need to think about what data is needed, create gold standards, collect finetuning data from contractors, evaluate model results using contractors, and use our judgment to improve instructions for contractors.
In the ideal world, we would just say "categorize whether this study is a randomized controlled trial" and an elegant machine involving copies of GPT-k, contractors, etc, would start up, generate a plan for accomplishing this task, critique and improve the plan, and execute it without any intervention on our part.
To get to this world:
Given a new task that models can't do out of the box, we need efficient mechanisms for gathering human demonstrations, using both a scalable contractor workforce and Elicit users. This is less distinctive to Elicit since everyone who trains models on human demonstrations and feedback has to cope with it. We are aiming to outsource as much of it as we can, but it is an important ingredient nonetheless.
Cases where users can provide good feedback but contractors naively can't are particularly interesting because they let us test how we can get feedback and demonstrations for tasks where it's hard to get good human oversight. They are a test case for the future where we want to accomplish tasks for which neither contractors nor users can provide feedback directly.
Zooming out, our milestones for the next few years are:
We’re starting by studying a group of researchers, who are thoughtful about how they discover and evaluate information and who have high standards of rigor. We’ll design Elicit to replicate their processes, using language models to apply them at a greater scale than humanly possible.
Eventually, we’ll make these research best practices available even to non-experts, to empower them when interacting with experts or making life decisions. We’ll support a diverse set of research workflows, then other workflows beyond research.
We’ll develop Elicit compositionally so that the system remains aligned and legible even as the reasoning it supports grows increasingly complex.
Today, researchers already find Elicit valuable. Yet there is much left to do. We’ve described the work we see ahead of us to get to a world with better reasoning. Join us!