This is an update on our progress towards our goals over the last six months. Briefly:
- We implemented two prototypes for our Factored Cognition project.
- We presented our work at CHAI, FHI, and a FHI/Deepmind seminar.
- We published a tech report on Predicting Slow Judgments.
- We hired Ben Rachbach as Interim Head of Operations.
- We got 501(c)(3) status from the IRS.
- We received a grant from the Open Philanthropy Project.
Research and implementation
In our Factored Cognition project, we explore whether we can answer difficult questions by decomposing the cognitive work involved into small pieces that can be tackled by individual agents who don't know the big picture.
In March, we published a review of technical background for Factored Cognition. The writeup presents a taxonomy of approaches to decomposing cognitive work. At the end of the writeup, we picked one of the approaches as a tentative implementation target, and we sketched a plan to experiment with the implementation in order to answer our key open questions about Factored Cognition.
Since March, we have implemented and open-sourced two prototypes, Mosaic and Patchwork.
Mosaic is a web app built by Ozzie Gooen and Andrew Schreiber that supports creating and editing recursive question-answer trees with pointers. It doesn't support automation, doesn't currently schedule work between multiple users, and still has a number of usability issues. You can take a look at our running instance or check out the source code.
Patchwork was built by Ben Weinstein-Raun and is a command-line app aimed at developing robust foundations for the next version of Mosaic, with principled support for multiple users and extensive automation. The repository includes a short screencast. Like Mosaic, it is a MIT-licensed open-source project.
By implementing Patchwork, we developed a better understanding of the design challenges that face systems for Factored Cognition. The main open challenges we see right now are to:
- Implement reflection that can capture all actions, including pointer expansions.
- Better understand budgets and how they should interact with cache-based automation.
- Understand how laziness and exception handling should interact. Investigate speculative execution based on predicted responses.
- Figure out how to simulate edits to questions in a user-friendly fashion if questions and pointer values are immutable in the underlying system, or decide not to make them immutable.
Our work on Patchwork also helped us reflect on Mosaic. We now think that perhaps we should have held off on developing a GUI app like Mosaic until we have worked out the relevant concepts using a command-line app like Patchwork. As it stands, we'll need to rewrite large parts of Mosaic once we're happy with our understanding of the foundations. However:
- Mosaic has been helpful for early informal testing, and we have used it extensively to visualize trees for presentations.
- It would have been difficult to find strong collaborators for extensive work on something like Patchwork.
Still, we've updated towards focusing more on conceptual development before implementation (perhaps through writing "throwaway" code).
We haven't done serious experimentation or testing of Factored Cognition yet, only informal single-user experiments using Mosaic. That said:
- We haven't yet encountered questions where it seems impossible to make progress through decomposition, so we’re slightly more optimistic that Factored Cognition can work for a very wide range of tasks.
- We now have a more visceral sense that, for many problems, decomposition will require a very large amount of work, so we often won't be able to instantiate complete question-answer trees explicitly and will instead need to "distill" the question-answer behavior of entire sub-trees using ML. For example, we think that this will be the case for the most natural approaches to belief updating, language understanding, and math proofs.
- We think there's a good chance that decomposition is a lot more difficult with multiple users, since each user lacks context on the parts of the tree that they haven’t seen.
We still plan to delay extensive experimentation until we have built a web app that supports cache-based automation. However, we think that we should have run basic multi-user experiments by now. We could have a human manager distribute tasks between users to run a multi-user experiment with the current version of Mosaic. Such an experiment should provide cheap evidence on how much harder decomposition with true isolation is, compared to a single person doing the entire recursive breakdown. We are planning to run such experiments with about five participants in the next two months.
We presented the Factored Cognition project at CHAI, FHI, and a FHI/Deepmind seminar. For people interested in AI alignment, our annotated slides are probably the best introduction to our work right now.
Predicting Slow Judgments
In our Predicting Slow Judgments project, we explore whether we can use machine learning algorithms to make well-calibrated predictions of human judgments in AI-complete domains.
Together with our collaborators at FHI (Owain Evans, Chris Cundy, Tom McGrath, Ryan Carey, Zac Kenton), we published a tech report (pdf) about the project. This report is based on our online experiment ThinkAgain, where we presented participants with statements about politics and Fermi estimation and asked them to judge the probability that each statement is true. We had them make multiple judgements per statement, giving them more time to think about their judgement with each iteration. Thanks to everyone who participated!
We used the data from participants to explore the ML problem of predicting expensive (slow) judgments in cases where most of our training data consists of cheap (fast) proxies. We used standard collaborative filtering algorithms, neural collaborative filtering, and Bayesian hierarchical regression.
Our data and models are available in this git repository. The dataset we collected has issues that make it difficult to explore the use of ML for making well-calibrated predictions of slow judgements based mostly on fast judgements. These issues include:
- While slow judgments were significantly more accurate than fast judgments, the difference was smaller than intended.
- Variability among subjects is difficult to distinguish from noise, so ML algorithms could not exploit similarities among users as in collaborative filtering.
- While users clearly found some questions more difficult than others, this variation is very hard for current ML algorithms to exploit.
Ryan and Tom attempted to manually predict users’ slow judgements based on the judgements available as part of the training set in order to learn how close to human-level our algorithms are. They got results comparable to a simple predictive models and qualitatively found the task to be very challenging. This provides some evidence that the dataset doesn't contain interesting patterns that humans could easily notice but ML algorithms can't.
In the end, we invested a large amount of labor for relatively little scientific gain (for example, about 400 hours of Andreas's time). Ryan surveyed the other team members post-mortem and the top issues (paraphrased) were:
- Objectives weren't sufficiently clear, especially at onboarding.
- Progress and value of the project weren't evaluated sufficiently often; there were no clear decision points to reconsider whether to keep going with the project or abandon it.
In our next academic collaboration, we'll discuss the project's goals and relation to the bigger picture during onboarding in more depth. We are also planning to do explicit monthly re-evaluations to determine whether the project should continue on its current course, change course, or be cancelled altogether.
We still feel that the ML problem behind this project—making well-calibrated predictions in AI-complete domains with infrequent direct supervision—is important, but plan to hold off on further work until we can do it in the context of data generated as part of our Factored Cognition project. A key long-term goal for Predicting Slow Judgments has always been to contribute to automating cognitive actions in that setting. If we can directly use that data and setup, we feel more confident that the things we learn will contribute to our organizational goals. If we use other settings—such as Fermi and Politifact probability judgments—as proxies, there is always a risk that the insights and solutions won't generalize to the setting we most care about.
Besides research, expanding our team has been our main focus. Over the past several months, we’ve been hiring for full-stack web developers, research engineers, researchers, and a chief operating officer.
We have not yet hired full-time employees for any of these roles. However, we have worked with contractors and a part-time employee to cover some of the responsibilities of these roles, and we have made progress towards hiring full-time employees.
As discussed above, we worked with contractors Ozzie Gooen, Andrew Schreiber, and Ben Weinstein-Raun to develop our research apps. We also hired Ben Rachbach as Interim Head of Operations. He previously worked as a software engineer at Wonder Workshop and as a research analyst at GiveWell. This will allow us to take time and be selective in filling our COO role.
An advantage of working with contractors has been that we’ve been able to hire them for work in their area of comparative advantage, then let them move on to other high-impact work once we have no more tasks suitable for them. For example, Ozzie architected and created the first version of our app Mosaic, and he’s now working on other projects until we need more help architecting web apps.
A disadvantage of working with contractors has been that we haven’t been able to build a stable, predictable team dedicated to Ought’s success in the long term.
In the past two months, we have focused mainly on hiring a COO and a senior research engineer who could architect the apps that we use for our research. We are still accepting applications for these roles, and we encourage interested people to apply!
For our research engineering role, we originally placed some emphasis on prior ML or web development experience. However, we have realized that in fact, general computer science and software engineering background are the most important experience required for success in the role. We are looking for candidates with experience creating good abstractions, who have functional programming experience, who may have built interpreters or compilers, and who generally have substantial CS background (as might be acquired by taking courses on design and theory of algorithms and programming languages).
In total, we have considered 117 potential hires so far (including leads we decided not to reach out to and candidates who are still moving through our hiring pipeline). We have created short trial tasks for both operations and engineering. Ben West has been helping us with the COO recruiting process.
We believe that the main reason that we have not hired full-time employees for the roles above yet is that we had not been able to spend much time on hiring until recently. We’ve set a high bar for hiring people full-time, so it has taken a good deal of effort on our part to find and evaluate candidates who might meet our bar. With Ben Rachbach and Ben West devoting substantial time to hiring, we’ve recently been able to reach a number of interested and potentially qualified applicants and move them through the hiring process.
Organization and funding
The IRS approved our application for 501(c)(3) status, which means that donations are now tax-deductible. We're not actively seeking donations at the moment and are primarily talent-constrained.
We moved out of a coworking space and into our own office in North Beach, San Francisco. We now have enough space to host visiting researchers working on topics relevant to our mission.
Since hiring Ben Rachbach, we have been making our way through a backlog of operations tasks such as buying insurance, setting up our office environment, and documenting and formalizing our processes.
Going forward, our priorities are:
- Run basic multi-user experiments with about five participants using Mosaic, simulating automated task scheduling using a human manager.
- Fill the remaining conceptual holes in Patchwork and develop the understanding required to build a robust multi-user app for question decomposition that is a good fit for extensive automation.
- Build a web app for Factored Cognition that builds on the concepts developed in Patchwork. Then make progress on our project plan.
- Fill our open engineering, research, and COO positions. If you're interested in working with us, get in touch!