🔥 What to expect when you’re expecting GPT-5
Imagining how the new models would change our daily work
An interview with Eric Schmidt in Noema magazine got me thinking about where we are heading with AI. In the interview, he makes this prediction:
The key thing that’s going on now is we’re moving very quickly through the capability ladder steps. There are roughly three things going on now that are going to profoundly change the world very quickly. And when I say very quickly, the cycle is roughly a new model every year to 18 months. So, let’s say in three or four years.
This week, Leopold Aschenbrenner, who was fired from OpenAI’s safety team for allegedly leaking material, published a 165-page blog post on AI and the decade ahead1. The post was received with both excitement and dismissal by various social media factions. (Examples of criticism are this one and this one by EV reader Vishal Gulati.)
In this short essay, I won’t critique the full document. Aschenbrenner goes into lots of detail about geopolitics, security, energy use, and more. Instead, I want to explore the implications of his argument about the trajectory of model capabilities.
Aschenbrenner says:
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured.
it is strikingly plausible that by 2027, models will be able to do the work of an AI researcher/engineer.
Aschenbrenner’s view is in the same sort of ballpark as Schmidt’s and roughly lines up with what Microsoft’s CTO, Kevin Scott, said. Scott talks about just how good GPT-5 promises to be:
Some of the early things that I’m seeing right now with the new models [GPT-5] is maybe this could be the thing that could pass your qualifying exams when you’re a PhD student. Everybody’s likely going to be impressed by some of the reasoning breakthroughs that will happen.
Scott, Aschenbrenner, and Schmidt argue that we would get these increased capabilities by scaling, which throws more computing power and data at the models. These bigger models are better—more capable of generalising, better at working with text, video, images and other types of data, more capable of holding context over long periods of time, more factual, and more precise. This idea, the scaling laws, is a widely held perspective that I’ve heard from other AI builders in the US and China.
For this discussion, let us accept the following ideas:
GPT-4 is more capable than GPT-3, which was more capable than GPT-2.
Scaling, as described by the scaling laws, accounts for those differences.
And that the scaling laws will hold for at least 2-3 more generations.
This means that subsequent models will have step-change improvements in capabilities, GPT-5 is to GPT-4 as GPT-4 is to GPT-3.
So, I want you to hold these claims as axiomatic for this essay and discussion.
We could argue about how long a “generation” is. Aschenbrenner says a couple of years, Schmidt says four years.
Each new generation of models is exponentially more complicated than the previous one. They require an order of magnitude more computing operations, meaning many more chips and power. The supply of chips, data centres, power, and financial capital could be a bottleneck for the most advanced frontier models in the 4-5-year period. It takes time to build these physical things. So I reckon a ‘generation’ is more likely to look like Schmidt’s four years.
Let’s assume these scaling laws hold and that the timelines are roughly practical.2 What kind of real-world impact would such a leap in performance have?
Faster, better, longer
We need to move from the technical aspects of these systems to what they actually do. Scientific benchmarks aren’t helpful because they can be divorced from how you and I use things in our daily life.
I find one of Sam Altman’s frameworks quite clarifying. It’s not scientific but it’s compelling. He talks about “the five-second tasks, the five-minute tasks, the five-hour tasks, maybe even the five-day tasks” that AI could do.
We could apply this to our experiences with LLMs: GPT-3 or GPT-3.5 was good at five-second tasks; GPT-4 is good at five-minute tasks, and perhaps GPT-5 will be good at five-hour tasks (or at least 50-minute tasks!).
Let’s break that down:
GPT-3 - good tasks that take a human five seconds, like parsing a sentence or producing a sentence fragment. Anything that takes longer is not very good and is prone to errors and confabulations.
GPT-4 - good at tasks that would take us five minutes, like summarising a short document, researching a topic and producing a summary or coming up with a problem statement and a set of ideas. It struggles with tasks that might take longer than five minutes. It still makes things up, often in pernicious ways.
This is only a rough framing, of course. I’ve used Claude 3 (which is basically GPT-4 quality) to read and analyse my book. It’s not a one-minute task, but it does it very well, and I know the subject matter.
But what comes next? Well, in Sam’s telling of the story, we could expect GPT-5 to handle five-hour tasks. This is congruent with Eric Schmidt’s argument that in the next five years, these machines will be able to undertake tasks that have 1,000 discrete steps. This is pretty substantial.
What is a five-hour task? It would be to, for example, write an essay like this, with citations and images. Or conduct a review of competitors and put together a summary report. Or to perform a detailed literature search in a narrow domain. Or to complete an entire quarterly VAT filing, finding invoices in an inbox and submitting them to tax prep software.3
We should also expect to see these models—while still unreliable—become substantially more reliable than previous versions. GPT-4 isn’t just better at longer tasks than GPT-3; it is also more factual.
The question I’ll pose to all of you is this: what does having a piece of software capable of conducting a task mean? A well-trained human - say MSc-level (although Kevin Scott says PhD-level) - takes 5 hours to do, but in any domain. That piece of software doesn’t cost much… Perhaps it’s free, $20 a month, or a little more.
And once you access a GPT-5-class model, you can use dozens or more of those PhD-level software assistants. A business could potentially run hundreds of thousands or millions. A state, billions.
Take my own experience
I analysed my usage of LLMs, which spans Claude, GPT-4, Perplexity, You.com, Elicit, a bunch of summarisation tools, mobile apps and access to the Gemini, ChatGPT and Claude APIs via various services. Excluding API access, yesterday I launched 23 instances of various AI tools, covering more than 80,000 words. This included the transcript of a four-hour podcast, which I wanted to query, and a bunch of business and research questions.
Today was quite fragmented. I was dealing with clients and business issues. I didn’t have much time to get really deep in research. When I do, I might easily run up much larger numbers, with deep extended queries that sometimes break the context window of Claude 3 (around 150,000 words or so). I’ve started to use agentic workflows in Wordware (where I am an investor) to sometimes automate several LLMs to help with a research or analysis task.
But if you look at the 23 instances above, we aren’t talking about five-minute tasks here. We’re talking about bundles of tasks that a typical person would take more than five minutes to do. Transcribe a four-hour podcast, summarise it, and get me the information I need from it? It would take closer to five days than five hours.
My experience is that a world in which we individually use more and more agents/LLMs/call-them-what-you-will is more likely than a world in which we don’t.
But that is the first order. The second-order effect is that my ambitions grow. Knowing I have access to these tools expands my willingness to use them.
I recently gave a speech at AI For Good at the ITU in Geneva. I was opening this big international conference, speaking just after Antonio Guterres and Doreen Bogdan Martin, who runs the agency. It was high stakes. In this case, I used GPT-4 and Claude to analyse a range of famous speeches from Kennedy, Churchill, and others and draw out the commonalities between them. They then critiqued each other's analytical frameworks. Then, I asked them to build a scorecard for evaluating great speeches. (Once again, they did the task, reviewed each other's work, and synthesized the results.) I then sent my speech through that template, and each LLM scored me against transcripts of other speeches given at the ITU. I instructed the LLMs to give me specific feedback. Some of this I actioned redrafting by hand, some I didn’t. And I rinsed and repeated until my speech scored 30/30 on the criteria. It was a much better speech, though perhaps not totally Churchillian.
On the point of growing ambition, before these assistants, I would have thought about writing as good a speech as possible. With them, my ambition was higher. By using GPT-4-quality systems, my expectations of myself have changed as well.
But of course, we need to think about:
GPT-5 and beyond
The fleets of AI assistants supporting us
The ripple effects of AI abundance
Not malfunction. Number 5 is alive
So, what will happen with models, let’s call it GPT-5, that are more capable than GPT-4?