Hi,
Azeem here. We’re introducing 📈 Chartpacks, a new format for investigating the questions we care about through quant and qual assessment. Each Chartpack will explore a particular exponential thesis over three to four weeks.
The first part of each Chartpack will be available to all recipients of the newsletter. The subsequent parts will be sent to the paying members of Exponential View.
We’re aiming to produce 13-15 of these a year.
In the first Chartpack, EV team member Nathan Warren explores how the way we evaluate AI systems has changed and the challenges posed to it by large language models like ChatGPT.
You can find part 2 and part 3 here.
Part 1 |
The mismeasure of AI: How it began
What if the way we evaluate artificial intelligence was flawed?1 The rapid rise of ChatGPT and other large language models (LLMs) has left us struggling to understand where we stand in the AI landscape. Old standards, like the problematic Turing Test2, are no longer relevant, with GPT-4's output already being (mostly) indistinguishable from human-made text. However, this doesn't mean that it has reached human-level intelligence, only that it can mimic our outputs. Even OpenAI’s Sam Altman deemed it "a bad test" for these models.
This leaves us in a predicament. How do we understand the capabilities and impacts of these models?
AI benchmarks - measurements used to evaluate the performance of various AI models in a standardised manner - play a crucial role in this understanding.
Unfortunately, existing benchmarks and evaluation techniques for AI contain numerous flaws that have been exacerbated with the rise of LLMs. In this series, we’ll explore the current state of AI evaluation and how researchers are fixing it to ensure the safe and more measured development of these models.
Before we move forward, I’d like to thank Exponential View members who made themselves available to read early drafts and gave their input into this first Chartpack. In particular, thanks to Ramsay Brown and Rafael Kaufmann!

Pawns of progress
By the 1980s, game playing, especially chess, became a centrepiece for AI research. Chess has long been viewed as a test of intelligence. With well-defined rules and a finite but computationally complex structure3, chess presented a challenging yet surmountable problem. The game’s quantitative rating system, ELO, served as a benchmark for AI researchers to measure their models’ progress over time. As models improved, they climbed the ELO rankings, surpassing amateurs, professionals, and eventually defeating world champion Gary Kasparov in 1997 - a landmark in AI history.
A byte-sized shift
Until the last couple of years, researchers tended to design AI systems to excel at specific tasks such as playing chess, recognizing speech, or translating languages. These models, called narrow AI, were limited to the tasks they were designed to perform.
However, the ultimate goal of the discipline since its inception in 1959 has been to create an AI system that can generalise across tasks, mimic human intelligence, and create new concepts. This is referred to as artificial general intelligence (AGI).
The exact path to AGI was never clear. However, we may have found a path to achieve progress using data-driven approaches - using large amounts of data to train and improve AI models. In the 2000s, Microsoft researched factors influencing AI system performance, particularly in natural language disambiguation tasks4. Their findings revealed that the type of model used was less of a factor of performance than the availability and quality of training data. This insight spurred a shift in AI research towards data-driven approaches.
The focus on data-driven approaches led to the development of large-scale language models trained on vast amounts of data (e.g., GPT-3 was trained on nearly a trillion words).
To capture the increasingly complex relationships within these datasets, models required more parameters5, significantly increasing their size.
Benchmark-busting beasts
The pursuit of larger models has yielded impressive results, with some even suggesting the recently released GPT-4 is an early version of AGI (see
LLMs have become so complex that they are difficult to evaluate using traditional AI benchmarks designed for narrow tasks. For instance, LLMs can generate new code, critique arguments, and even understand images. These capabilities are not evaluated in older benchmarks.
This led to a surge in new natural language processing benchmarks since 2014 as researchers seek more comprehensive measures.

LLMs are considered general-purpose technologies with potentially wide-reaching societal and economic ramifications. As a result, it is essential to have the appropriate evaluative benchmarks to guide and maintain control over their impact.
In next week’s Chartpack (for members only), we will explore the challenges of evaluating LLMs and the potential societal consequences if we fail to address them appropriately.
Nathan’s research in this reminded me of Stephen Jay Gould’s Mismeasure of Man, a book I read nearly 40 years ago. Gould critiques how measurement of human intelligence was misused to justify biological determinism and social inequality. - Azeem
The Turing test evaluates a machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human.
There are an estimated 10^43 board positions.
Natural language disambiguation is the process of determining the correct contextual meaning of a word. For example, “bank” can mean either a financial institution or the side of a river, depending on the context.
Parameters control how a model responds to a prompt, therefore if you change the parameter, you change the response.
I like the idea of Chartpacks. Looking forward to more.
Moving past the Turing Test is refreshing. The notion that we are moving toward AGI is in the air, but the notion is not crisp. Goethe's Faust comes to mind (The Study, scene 3). Goethe's moves from an examination of ‘the word’ to ‘the act’. Is this line of thinking useful for considering benchmarks?
It seems like a new level of reinforcement learning has been central to recent progress. Should we measure interactive human input in addition to the size of text examined and parameter count?
Examination of natural scenes as training data may loom large. Another thing to measure.
The Bloom project from BigScience suggests the amount and quality of energy required is important.
I just did a quick GPT check. Evidently about 15% of the world's population speaks English as a first or second language. To the extent that we are interested in general intelligence, measuring performance across languages is important. In this regard, the reference to Stephen Jay Gould’s Mismeasure of Man, is especially valuable.