Large language models like Claude 2 are getting increasingly good at summarising documents like research papers. But there’s a catch: LLMs, which are typically trained with human feedback, tend to be people pleasers.
According to a recent study by Anthropic, the company behind Claude, even the best models will often tell users what they want to hear: “we found sycophantic behavior across five AI assistants in realistic and varied open-ended text-generation settings… such behavior is likely driven in part by humans and preference models favoring sycophantic responses over truthful ones.”
The same problem crops up in real-life, particularly when there is potential for power dynamics in a group. When discussing evidence in a meeting, one of the worst questions a senior staff member can ask is: “does anyone disagree with me?”
This question forces anyone with a differing view to publicly pit their ideas against someone more senior, creating the risk of asymmetric conflict. There may be a junior researcher who does disagree - and for correct reasons - but the discussion isn’t set up to elicit these insights.
During COVID, a certain catchphrase would emerge in the weekly SPI-M-O advisory meetings. After presenting some preliminary analysis (and in real-time, most analysis was preliminary) the person who’d done the work would often finish by saying: “Tell me why I’m wrong.”
Not if they were wrong, but why they were wrong. The habit had developed organically, and it led to a useful shift in dynamics. Rather than being a route to conflict, finding a problem was now the expectation. There was always going to be a caveat to the analysis, a weakness, something else to explore. If you were attending the meeting, it was your job to spot it and say what it was.
It’s an approach I’ve since tried to use more consciously in research discussions: “I’m almost certainly missing something. What I am I missing?”
It’s also an approach I’ve been trying out more and more with LLMs. I now try to avoid simply asking what a document shows, or whether it includes a certain piece of information (“yes, of course it does”, “are you sure?”, “sorry, no”). Instead, I usually summarise my interpretation, then ask the LLM what I’m missing, or why I’m wrong: “I think this paper shows X. Tell me what I’m missing and why it matters.”
Sometimes it flags something helpful, sometimes it comes up with ‘gaps’ that don’t really matter. But in both instances, it helps give me a little bit more confidence that I might not be totally wrong.
For more on how LLMs learn from human preferences, see this short post:
Adam, thank you as always for a practical insight. I will definitely try this.
Kukuh