Hi,
Azeem here. This is part 3/3 of our inaugural ๐ Chartpack, a new format to explore a particular exponential thesis over three to four weeks.
In the past two weeks, my colleague Nathan Warren explored the complications of evaluating large language models. In case you missed Parts 1 and 2, start here: The mismeasure of AI: How it began and The mismeasure of AI: The failings.
Over to
for the final ๐Chartpack that explores the enormous question of measuring and evaluating LLMs!Part 3 |
The mismeasure of AI: The steps ahead
Over the past two weeks, we have delved into the complexities of AI evaluation in the context of large language models (LLMs) and the challenges they present. This week, I explore current and potential solutions to these complications, aiming to:
Enhance our understanding of the development towards artificial general intelligence (AGI), and
Evaluate the societal effects and risks posed by these models.
As LLM capabilities advance, a deeper understanding of their abilities and real-world implications is crucial. Microsoft researchers released a report suggesting that GPT-4 may be an early version of AGI (you can see
commenting on this here). We need more than mere belief from a paper that hasnโt yet been peer-reviewed, we need to have much more granularity that expresses the capabilities and limits of these systems.The benchmark to rule all benchmarks
To address this need, comprehensive benchmarks must be developed that cover the growing capabilities of LLMs and remain relevant over time. One solution is the Beyond the Imitation Game Benchmark (BIG-Bench), which evaluates performance in over 200 tasks, including math, bias, and common-sense reasoning. BIG-Bench performance rises slowly as models increase in size. Extrapolating this suggests that for the benchmark to become saturated we will need much larger models, which is infeasible with near-future compute budgets.1
However, forecasting performance may be unreliable due to our limited understanding of the capabilities that emerge as models increase in size. LLMs display unpredictable emergent properties as they scale, with task performance varying from improving gradually, abruptly, or even getting worse. Additionally, given that we donโt know what corpus of data these systems are trained on, we canโt be sure what is an emergent capability and what is the system just parroting back things it has been trained on.2
To tackle this issue, we need a benchmark set that can adapt to new model capabilities as they emerge over time. Stanfordโs Holistic Evaluation of Language Models (HELM) serves this purpose. By developing a taxonomy of use cases and metrics to measure model performance, we can add and adapt use cases based on what emerges as important over time. This taxonomy provides a broad view of what we are measuring and what we might be missing. Additionally, it tracks various alternative metrics (efficiency, bias, disinformation) which helps provide a complete picture of model capabilities.
Superhuman capabilities
BIG-Bench and HELM offer crucial insights into model capabilities; however, as AI is anticipated to undergo exponential growth and potentially surpass human intelligence, these benchmarks may soon become less relevant. In the face of such rapid advancement, it might become challenging, or even futile, to directly evaluate the capabilities of these highly advanced models.