Markets for microtasks with uncertain rewards

Suppose you want to improve the quality of a Wikipedia page and you have decided to spend $100 on this project within the next month. You don’t want to hire a single editor — instead, you would like to crowdsource this task to the Wikipedia community and reward members to the extent that their work improves your article. You make an announcement, and at the end of the month, you see that there have been about 500 edits to your page, some changing single characters, others making major revisions, some undoing previous edits. How much do you pay to whom?


Here are some analogous situations:

  • You have a repository on Github and you want to reward commits depending on how much they help your project achieve its goals.
  • You want help brainstorming, say to solve a medical case, and people can see and build on previous suggestions.
  • You want to outsource a logo design and contributors can see previous designs.
  • You want to reward high-quality submissions and upvotes on a social news site such as Reddit or Hacker News.

In general, whenever you want to incentivize contributions to a project where users can build on the previous state of work, you face a credit assignment problem.

Why this is difficult

What the situations above have in common is that individual contributions (or “tasks”) are of highly variable and uncertain value to the payer. Some edits to a Wikipedia article, changes to code, or incremental variations on an idea are much more helpful than others, but the exact value of each task is difficult to determine. This distinguishes the tasks from essentially all tasks that are currently available on crowdsourcing platforms such as Mechanical Turk.

At the same time, as the payer, you would like to give out rewards in proportion to how helpful a contribution is, so that market participants are incentivized to be as helpful as possible. However, the overhead of evaluation can easily be comparable to or larger than the value of the task itself; sometimes much larger. It is generally economically infeasible to evaluate each task to determine how much to pay.

A baseline strategy: randomized evaluation

In this situation, is it possible at all to set up a mechanism with the correct incentives? Consider the following strategy:

  1. For each task, evaluate it in depth with probability ε and decide what reward r to pay.
  2. Pay r/ε if a task gets evaluated and nothing otherwise.

By reducing ε, the expected evaluation cost can be made arbitrarily small. In expectation, every participant still gets paid the correct amount.

Of course, the variance of rewards under this strategy can be ridiculously high and real participants are not risk-neutral. But the fact that there is a strategy that reduces the cost of supervision and that has the right incentives matters: It shows that we can view our goal as reducing the variance in our payments while introducing as little bias as possible.

(I think I first encountered this view talking to Paul Christiano.)

Predicting rewards using supervised machine learning

What if we don’t want to only reward contributions that we evaluate in depth? A natural variation on the strategy above is to evaluate some fraction of tasks, pay exactly the elicited rewards for the tasks that we do evaluate, and predict the rewards for the tasks that we don’t evaluate.

We can start by treating this as a supervised machine learning problem. The tasks where we do evaluate in depth serve as a training set. We’re trying to learn a function f: X → P that takes as input a set of task features x and returns a distribution on rewards p. When we need to decide on a reward, a first solution is to provide exact rewards where we do have evaluations, and to simply take the expected value of p otherwise.

There’s a wide range of features we can use, most prominently the identity of the task author and associated information (such as their history of rewards) and judgments by other participants (likes, upvotes/downvotes, 1-5 star ratings, proposed rewards). We can also provide the content of the contribution, but making good use of metadata such as author identity is easier and will probably do most of the work in the near future.

Complications and extensions

The true value of a task may be difficult to determine

Previously, we’ve talked abstractly about evaluating a subset of tasks in depth. In a first concrete attempt, we might simply ask the payer: “How much is this task worth to you?” However, this may be difficult to answer directly, even if we allow for plenty of time to reflect.

One strategy we can apply here is to not just ask a single question about the reward for a task, but to ask many questions in order to triangulate the desired reward. For example:

  • How helpful were the tasks as a whole?
  • Which tasks were most helpful?
  • Which tasks were least helpful?
  • Is task x more helpful than task y?
  • How much worse would you have been off if task x hadn’t happened?
  • What is the next bigger group of tasks that x could be grouped with?
  • How helpful was this entire group?

If we define a formal semantics for each of these questions and a probabilistic model that relates their answers to each other and to the question “How much should we pay for task x?” we can use answers to these questions to automatically infer the reward without ever asking explicitly about it (although we might want to do that as well, e.g. asking “The inferred reward is r — does that seem too high, too low, or just right?”).

Task features may not be sufficient to make high-quality reward predictions

First, if we predict distributions on rewards, we can use the entropy of the predicted distribution to decide when to go with the expected value (when the distribution is fairly peaked around a single value) and when to gather more information.

Second, we can incentivize the creation of predictive signals: we can reward contributors who provide information (such as informative likes, or accurate reward guesses) that helps reduce our uncertainty about what rewards to pay.

Markets are anti-inductive

If a feature is predictive of quality and is therefore used to incentivize good contributions, people will try to produce contributions that have this feature, independent of contribution quality, and it will tend to stop being predictive. If long comments tend to get the highest ratings on Hacker News and we use this fact to automatically reward the best comments, we’ll soon see long low-quality comments. In other words: people will try to cheat.

This is Goodhart’s law:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

We are indeed trying to apply machine learning in a setting where the underlying distribution is shifting, i.e. in a non-stationary setting. This will make the learning problem harder, and our predictions will be more uncertain, but I expect that there are techniques we can apply to get acceptable performance. Ongoing work in ML, perhaps along the lines of Kuleshov 2017, will hopefully improve the situation over time.

Rewarding highly predictive signals (such as upvotes by a particular user) can also help prevent some signals from deteriorating.

Some tasks can only be evaluated in context

For example, consider deletions in a document that someone made in preparation for larger edits, compared to deletions that are simply vandalism.

There’s a lot we can do:

  • We can show context (such as previous and subsequent tasks) when doing reward evaluations, and also use contextual features as additional inputs into our prediction algorithm.
  • We can require contributors to batch tasks into semantically meaningful pieces, and give low rewards otherwise.
  • We can require contributors to provide annotations and justifications for tasks that are otherwise hard to judge.
  • We can evaluate larger groups of tasks to determine the value of smaller pieces (see 1).

To arrange fair trades, consider both parties

If we want to arrange trades that are fair to both parties, or if we want to control how gains from trade are distributed more generally, it’s not enough to determine how the payer should value each task. In addition, we need to take into account that tasks vary in how much effort they require. For mutually beneficial trade, the payments need to be somewhere between what the tasks are worth to the payer and what the efforts cost workers. I have only considered the former above. For information work, I expect that workers will generally have a better grasp on effort than payers do on value, so I think it makes sense to start with the latter, but ultimately we want to get a handle on both.

This is a slightly generalized version of the ideas in the section on reward distribution using probabilistic models in the original report on Dialog Markets. Virtual Moderation is an application of similar ideas to the setting of moderating online communities.