When is the 'ground truth' not quite the whole truth?
AI often learns the quirks of specific humans rather than human-like ability
“Good because I’ve been waiting since September!”
What emotion do you think this statement conveys? Excitement? Relief? Gratitude? Approval?
The quote is from a Reddit post, and also forms part of the GoEmotions machine learning training dataset. In 2020, a team from Google, Stanford and Amazon built this dataset by asking humans to tag 58,000 Reddit comments with 28 different emotion categories.
The humans often disagreed. When five humans were asked to tag the statement at the top with emotion(s) from the list, each person gave a different opinion:
Excitement and joy
Approval and relief
Admiration
Gratitude
Approval and desire
In the final ‘consensus’ dataset, emotion labels were only included if they’d been selected by at least two annotators. Hence the above response was only tagged as being about ‘approval’.
This leads to an apparent paradox: the ‘ground truth’ emotion in this case is one that a majority of the original human annotators did not think matched the statement.
As a result, the judgements of the human annotators were not a very good reflection of the ‘consensus’ derived from their judgements. When I imported the raw data, then held out the annotators one-by-one from the dataset, I found that each annotator was generally a bad match for the consensus. The below shows the distribution of F1 scores for the human annotators vs the ‘ground truth’ consensus. (F1 ranges from 0 = worst to 1 = best, and provides a summary of prediction performance that penalises both false positives and false negatives.1)
In the accompanying paper, the research team trained a BERT language model on the ‘true’ emotions in the consensus dataset. This model achieved an F1 score of 0.46, which meant it performed better than 87% of the underlying human annotators when it came to predicting the consensus emotions:
This might seem like an early win for generative AI. But just as individual humans were bad at matching the ‘consensus’, a model trained on a summary dataset is not learning a general theory of human emotion. At best, it is learning the aggregation rule that the researchers applied to annotator judgements. And, much like the annotators themselves, it cannot directly recover the original author’s emotional intent:
Indeed, when the research team applied the trained model to wider emotion-labelled datasets, it initially performed poorly (i.e. low F1 values on left side of each plot for ‘transfer learning’ below). It only improved once it had been fine-tuned on many additional annotations from that specific new dataset:
The non-wisdom of crowds
In 1907, Francis Galton famously observed that when hundreds of farmers guessed the weight of an ox in a livestock competition, the average of the human guesses was 1197lb and the true weight was 1198lb2. This would become known as the ‘wisdom of the crowd’: if we combine multiple independent and diverse judgments, it can produce an estimate that is more accurate than most individual opinions.
But the ox had an objective weight that could be verified. In contrast, judgement about emotion is inevitably subjective. If you think a statement is about ‘gratitude’ and two other people think it’s about ‘approval’, does that mean you are wrong?
After all, we wouldn’t necessarily accept this approach to truth in situations where there was an objective and verifiable outcome. As Marilyn vos Savant once put it, ‘math answers aren’t determined by votes’.
Emotion classification isn’t the only situation where the concept of a ‘ground truth’ can shift. Sometimes, as with emotions, the problem is one of subjectivity. On other occasions, the ‘true’ target is objective in theory but imperfectly defined in practice.
For example, earlier this month Claude Opus 4.5 performed strongly on the CORE-Bench benchmark. But as Sayash Kapoor noted, there was a catch: Claude’s score had to be upgraded after it revealed some errors in the existing (human-created) auto-grading process for the benchmark. As he put it: ‘These tasks are often unsolved because of bugs in grading rather than agents genuinely being unable to solve tasks’.
So what should we do if our definition of ‘ground truth’ involves some inevitable subjectivity?
There are three main options we can consider:
Scale a specific expert judgement.
One option is to be explicit about whose judgement we are trying to reproduce. In this situation, a model might be trained to mirror the decisions of a particular expert on a given dataset; in other words, it would try and learn what a given human does, rather than what humans do in general. For example, an investment firm might employ an economist to interpret news articles – training on their interpretations could allow their judgement to be mimicked at scale.Define the consensus as the target.
An alternative is to focus on a consensus measurement, regardless of what individual humans concluded. For example, if an outcome depends on an aggregation rule, then the model’s task might be to learn the result of that rule. Take LIBOR (the London Inter-Bank Offered Rate). This was not a true market interest rate but an average of estimates submitted by banks – which traders nonetheless spent a lot of effort trying to predict.Model the distribution of human disagreement.
Rather than collapse disagreement into a single consensus label, we could instead predict the full distribution of human responses. This idea has a long history, from models such as Dawid–Skene – which account for annotator reliability while still assuming a single underlying label – to more recent approaches that allow overlapping labels and explicitly treat ambiguity as signal rather than error. The idea here is to capture the spread of noisy human judgements, rather than throwing away this information.
Ultimately, it is less about defining a single ‘ground truth’, and more about what we count as a successful model.
It’s worth noting that F1 score has a number of limitations as an evaluation metric, as elaborated on in this recent paper. But I’ve used it to allow direct comparison with the model in the paper.
Galton initially reported the human median guess of 1207lb, but later acknowledged that the mean was even closer.






Fascinating. I favor option 3, but maybe that’s because I didn’t pick any of the initial emotion descriptions. I would have said “impatient but placated” or “happy but a bit resentful about the delayed gratification.”
My reaction is 2/3 Annoyance - I've waited since September; and 1/3 relief - the event has happened (thing has been delivered, or whatever). So I have an emotion not covered in the list, and an unequal apportion. So perhaps they should (1) run past the respondents to gather possible reactions, (2) remove synonyms (3) weight similar responses, to remove dilution between similar views (4) allow apportion between mixed views.