Small hallucinations, big problems
Why a 1% hallucination risk is terrible if you’re using LLMs to look for unusual events
Imagine that you’ve collected a pile of news reports from yesterday, and you want to check if any particularly unusual has happened. Perhaps you’re looking for a specific type of emerging crisis, or scientific discovery. Let’s say you expect to find the thing you’re looking for a handful of times per decade on average. So on a given day, there’s about a 1 in 1000 chance of it being there.
Reviewing reports takes a lot of time, so to make things easier, you’ve decided to use some large language model (LLM) agents to check for unusual events. In a popular ‘hallucination leaderboard’, the best models have just under a 1% probability of hallucinating when generating a short summary of a thousand short documents. So let’s assume, optimistically, that there’s a 1% chance they’ll make something up (i.e. they’ll miss a genuine event, or say there is one when there isn’t) when trying to detect events in daily news reports.
You set the LLM agents to work in the morning. Before long, they’ve come back with the conclusion that the unusual event you’re looking for has happened.
What’s the probability that this is true? It might be tempting to say 99%, given the 1% hallucination risk we’ve assumed. But remember, our baseline assumption is that there’s only a 1 in 1000 chance of an unlikely event on a given day. So we need to consider two possibilities: the event is genuinely in the reports and the LLM has flagged it, or the event isn’t and the LLM has hallucinated.
How likely are these two possibilities? For the first one, we’re assuming there’s a 0.1% (in 1 in 1000) chance the event happened, and 99% probability the LLM would spot it if it did. Which gives us1:
P(LLM flags event and the event happened) = 0.1% x 99% = 0.099%
For the second possibility, there’s a 99.9% (in 999 in 1000) chance the event did not happen, and 1% probability the LLM would have hallucinated in this scenario. Which gives us:
P(LLM flags event and event did not happen) = 99.9% x 1% = 0.999%
In other words, there’s a 0.099% probability the LLM flags an event and the event happened – but a 0.999% probability (which is ten times larger) that the LLM flags an event and the event did not happen.
Hence if the LLM flags an event, ten times out of eleven, it will be a false alarm. Even though the raw hallucination rate is 1%, because events are rare, there is a more than 90% probability that the event has not happened, given the LLM flags it.
Statistics fans will recognise the above as an application of Bayes’ Theorem, which allows us to work out the probability something is true given a test result, based on our prior assumptions about the chances of it happening, and the reliability of the test (i.e. LLM analysis).
As the below shows, if a 1% hallucination-prone LLM flags an event based on reports, the probability this is true depends strongly on the rarity of the underlying event:
Humans to the rescue
You might think this isn’t much of a problem. We can just get the LLM to give us the source in the reports, and go away and check it. But what if we’re interested in tracking a very large number of rare events every day? And, based on the above, each LLM-flagged event is overwhelmingly likely to be a hallucinated false positive? The need for a constant human-in-the-loop will quickly reduce any productivity benefits of the LLM.
Similar issues are cropping up in other studies. Take a randomised trial published last month, which found AI tools actually slowed down open-source developers. This was particularly remarkable given the developers had expected the AI to make them faster – and even estimated it had made them faster when asked after the experiment:
As LLM-based analysis becomes more common, the above shows it’s not enough to look at baseline hallucination rates. (Especially as hallucination evaluation models can also hallucinate – I’ve got some interesting findings here that I’ll expand on in a future post.) If we’re interested in rare events, from health threats and business tail risks to unusual computer bugs and intelligence monitoring, even a small hallucination risk can soon produce an overwhelming number of false positives at scale. And it will no doubt fall to a human to check them.
If you’re interested in Bayes’ Theorem, AI and different form of error, you might like my latest book Proof: The Uncertain Science of Certainty. I recently spoke to Science Friday if you’d like to hear more, and the book was reviewed in Science last week, which called it ‘an expansive exploration of certainty in science, from geometry and jurisprudence to randomized trials and neural networks’.
This calculation works by using the probability rule: P(A and B) = P(A given B) P(B)




Why, isn't that Tversky and Kahneman's Taxicab Problem once again? (cf Amos Tversky and Daniel Kahneman, "Evidential Impact of Base Rates", No. TR-4, Stanford University Department of Psychology, 1981)
Once you understand the problem, you find it almost everywhere you look. My conjecture is that this is all there is to Psychology's current replication crisis, but I could be wrong. What we can be sure of is that scores of scientists confuse the type I error rate with the probability of their theory being wrong, which has elsewhere been dubbed "the prosecutor's fallacy" because it's also endemic in legal proceedings.
This Bayesean problem also popped up elsewhere in not-too-distant history, to wit: AIDS tests. They faced the usual dilemma of not wanting to let a truly infected person remain undetected (i.e. maximize sensitivity), but also not let themselves be buried in false positives (i.e. retain a manageable specificity).
IIRC their solution in the 1980s was a sequence of tests called ELISA and Western Blot, respectively. ELISA took care of the sensitivity, and when that came back negative, you could be quite sure you were not infected. If ELISA found something, they confirmed or refuted it with Western Blot which was used to weed out the false positives.
Which makes me wonder: Are LLMs really an ELISA for emerging crises? I mean, just because LLMs are "hallucinating" doesn't mean they catch everything there is to catch. And if they are something like ELISA, then what could be the Western Blot for LLMs?
Adam, I am always learning a new thing from your post. Thank you!. Please write further on the fact that "hallucination evaluation models can also hallucinate". It is quite scary. Hopefully, you'll teach us some solution Adam.