Exams won't tell us whether AI has reached 'peak intelligence'
Acing hard questions isn't the same as doing hard thinking
What is a gift?
Would you rather be a vampire or a zombie?
Improve the rules of any one sport.
Get your pen and paper out. You have three hours.
The Examination Fellowship at Oxford’s All Souls College is famous for the brevity and open-endedness of the questions in its two ‘general exams’. To succeed requires creativity, thoughtfulness, broad knowledge – and an ability to distil all of these into a fast, coherent response. As a result, it’s been dubbed the ‘hardest exam in the world’.
Or is it?
Experts and exams
Last week, ScaleAI and the Center for AI Safety announced a call for submissions for what they called ‘Humanity’s Last Exam’. They want people to submit questions – along with correct answers – that even the best artificial intelligence would currently struggle with. They say it is ‘a project aimed at measuring how close we are to achieving expert-level AI systems’.
In recent years, AI has become better and better at passing common exams. AI has even made progress on particularly challenging questions. In July this year, DeepMind’s AlphaProof got the equivalent of a Silver Medal at the International Mathematics Olympiad, the premier competition for pre-university mathematics students. One of the questions AlphaProof got right was only solved correctly by five of the human competitors.
But will ‘Humanity's Last Exam’ really tell us anything about ‘peak intelligence’ or ‘expert-level performance’? It’s useful to be able to benchmark AI, and one of the reasons AI has been so definitively successful in games like Go and competitions like protein structure prediction is there is a definitive outcome: you win the game or predict the correct structure. However, most of life – and the intelligence required to navigate it – does not fall neatly into this category.
Building and demonstrating expertise is not just about passing exams. In a post last year, Daisy Christodoulou had some thoughtful discussion about the purpose of exams in the era of AI. She made the point that we shouldn’t prioritise AI-proof exams above all else. The reason? Some fundamental skills – which AI will often be able to outperform humans on – need to be developed on the route to deeper expertise. She argued that ‘the point of an assessment is not the product but the process’:
Fundamentally, it does not matter that a computer can answer this question better than the student. What matters is what the student has learnt from answering it, and what it tells us about their understanding.
The basic assessment principle here is the difference between the sample and the domain. The sample is the test itself. The domain is the student’s wider understanding. The sample only matters if it tells us something valuable about the domain — otherwise it is worthless.
The risk with treating exams as the ultimate challenge for AI is that we end up defining expertise and intelligence based on what can be easily measured and checked, rather than what is actually required to cultivate it. Even though the best AI models can ace some science exams, they still struggle with the process of doing actual scientific research in many fields. (Believe me, I’ve tried several tools on several thorny research problems, often with limited success.)
In general, the only exceptions currently are topics more open to the computational ‘brute force’ of deep learning approaches. To get a sense of where AI will probably contribute most scientific progress over coming years, it’s worth considering what scientific questions an investigator would focus on if someone recruited 100 new research assistants to work with them. It wouldn’t be questions that require a lot of tailored expert input at every step, like investigating 100 different outbreaks in 100 different locations. It would be questions that could be split into parallel tasks, with standardised methods that scale and results that can be easily checked. In other words, it would be exam-like questions.
Process and product
I’ve written a lot of exam questions over the years, and supervised a lot of scientific projects. And while exams are useful for testing whether students have a good grasp of concepts and skills, providing a foundation for the next stage of their training (i.e. ‘the process’), these exams are not the ultimate test of expertise. And they never will be.
In the wider world of science, the real test – the real ‘product’ – is whether someone can go out and tackle real problems. The sort of problems that don’t already have a neat ‘correct’ answer we can enter into an online submission form. The sort of problems that people haven’t worked out how to solve yet, which might have solutions that end up overturning a lot of existing theories and textbooks.
In other words, the sort of problems that require a true expert.
Cover image: Chris Liverani via Unsplash
I like your different perspectives here about exams being about process or product. I have always considered the summative assessments used at the end of an academic course to be a blunt but fair method to establish some kind of individual student ‘level’ of achievement. But when you start to consider this as the product and the next stage (whatever it may be) as the more valuable process, the final exam becomes formative at which point the students needs more feedback to really understand its vale. And I don’t think this ever happens does it?
Adam, thank you! for pointing to a "hope" to us a mere mortal