Azeem here. As you may know already, we’re introducing 📈 Chartpacks, a new format to explore a particular exponential thesis over three to four weeks.
The first part of each Chartpack will be available to all recipients of the newsletter. The subsequent parts are open to the paying members of Exponential View.
You are reading Part 2 of our inaugural Chartpack, researched and written by EV team member Nathan Warren. Nathan explores the complications of evaluating large language models. In case you missed it, start with the first part: The Mismeasure of AI: How it Began.
Part 2 |
The mismeasure of AI: The failings
In this week’s Chartpack, I explore some of the challenges of evaluating LLMs and the potential societal consequences if we fail to address them appropriately. Last week, we covered the current state of AI and its evaluatory landscape, which you can read here.
Slow down, you move too fast
Large language models are rapidly advancing in proficiency across various tasks, leading to the accelerated achievement of near-peak performance (often around 90%) on established benchmarks. At that point, the rate of improvement flattens, resulting in a phenomenon known as saturation. When this occurs, the benchmark is no longer useful for measuring progress.
For example, the GLUE benchmark - which aimed to evaluate language model performance across a range of language understanding tasks - reached saturation in two years. Because of this, the designers of GLUE developed SuperGLUE, a collection of more challenging tasks.
Guess what happened. SuperGLUE reached saturation in only 18 months. (What’s next, UltraGLUE???)
The short lives of these benchmarks make it difficult for us to know exactly where we stand in the field of AI. It’s hard to know exactly where we are heading when the goalposts keep shifting.
Anything you can do (AI can do better)
The goalposts aren’t just shifting; they’re getting wider. Not only are LLMs getting better at existing tasks, but it is becoming apparent that “human performance on a task isn’t the upper bound” of that task. LLMs are also developing new capabilities: they are the first AI models capable of generalizing across different tasks, showing emergent abilities such as reasoning and creativity. They are displaying a type of generalised capability.1