The Financial Times quoted me last week calling GPT-5 “evolutionary rather than revolutionary.” Then it twisted the knife: “Release of eagerly awaited system upgrade has been met with a mixed response, with some users calling the gains ‘modest.’”
“Modest” is one of those quietly loaded words. The Oxford English Dictionary defines it as ‘relatively moderate, limited or small’. ‘Relatively’ does a lot of work there. Compared with GPT-4 in March 2023, GPT-5 is a huge leap. Compared with bleeding-edge models released just months ago, it feels incremental. And compared with the sci-fi hopes, it’s restrained.
The funny thing is, GPT‑5 does what no model before it could, yet in the same breath makes its own shortcomings impossible to ignore.
Over the past week of using GPT‑5, I’ve been tracking these tensions. Today, I’ll break down the five paradoxes that define GPT-5’s release and help explain why so many people find it confusing.
These five paradoxes show how this can be the most capable model so far, yet still earn that stubborn label: ‘modest’.
1. The moving-goalposts paradox
The smarter AI gets at our chosen benchmarks, the less we treat those benchmarks as proof of intelligence.
We measure machine intelligence through goalposts – tests, benchmarks and milestones that promise to tell us when a system has crossed from ‘mere software’ into something more. Sometimes these are symbolic challenges, like beating a human at chess or passing the Turing test. Other times they are technical benchmarks: scoring highly on standardised exams, solving logic puzzles or writing code1.
These goalposts serve two purposes: they give researchers something to aim for and they give the rest of us a way to judge whether progress is real. But they are not fixed. The moment AI reaches one goalpost, we often decide it was never a real measure of intelligence after all.
The first goalpost to shift was the Turing test.
Proposed in 1950 by Alan Turing, the “imitation game” offered a practical way to sidestep the slippery question “can machines think?”. Instead of debating definitions, Turing suggested testing whether a machine could respond in conversation so convincingly that an evaluator could not reliably tell it from a human.
I propose to consider the question, ‘Can machines think?’ This should begin with definitions of the meaning of the terms ‘machine’ and ‘think.’ The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ‘machine’ and ‘think’ are to be found by examining how they are commonly used, it is difficult to escape the conclusion that the meaning and the answer to the question ‘Can machines think?’ is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described in terms of a game which we call the ‘imitation game.’

For decades the test stood as the symbolic summit of AI achievement. Then, in June 2014 – four years before GPT‑1 – a chatbot named Eugene Goostman became the first to “pass”. Disguised as a 13‑year‑old Ukrainian boy, it fooled 33% of judges in a five‑minute exchange. Its strategy was theatrical misdirection: deflect tricky questions, lean on broken English and exploit the forgiving expectations we have of a teenager. As
observed at the time:The winners aren’t genuinely intelligent; instead, they tend to be more like parlor tricks, and they’re almost inherently deceitful. If a person asks a machine “How tall are you?” and the machine wants to win the Turing test, it has no choice but to confabulate. It has turned out, in fact, that the winners tend to use bluster and misdirection far more than anything approximating true intelligence.
Earlier this year, a paper claimed that GPT‑4.5 passed a more rigorous, three‑party Turing test, with judges rating it human in 73% of five‑minute conversations. Whether this counts as a pass is still contested. On the one hand, Turing’s test measured substitutability – how well a machine can stand in for a human – not genuine understanding. On the other, critics argued that short exchanges are too forgiving and that a meaningful pass would require longer, open‑ended dialogue.
But if we say that AI has passed the Turing test, what does that even mean? The victory feels hollow. Once systems beat the Turing test, we moved the bar: from conversation to formal benchmarks. LLMs went on to crush many of these, too. Yet the same pattern holds: the smarter the system, the less its achievements feel like proof of intelligence.
Here the conversation shifts from tests we can name to a target we cannot agree on. Part of the problem is definitional: there is no consensus on what artificial general intelligence is. Is it matching human cognition across all domains, or being a flexible, self‑improving agent? Intelligence resists collapsing into a single score. Decades of IQ debates show that. Is AGI a universal problem‑solver, an architecture mirroring human thought, or a form of consciousness? With such a hazy target, success will always feel provisional.
Sam Altman now calls AGI ‘not a super useful term.’ I’ve long found the term problematic; it’s not an accurate descriptor of what LLMs are or their usefulness. Suppose a system were truly “intelligent” in the human sense. Couldn’t we train it only on knowledge up to Isaac Newton and watch it rediscover everything humanity has learned in the 300 years since? By that standard, GPT‑5 is nowhere close – and I did not expect it to be. Its goal was not raw knowledge accumulation, which arguably defined GPT-4’s leap. GPT-5’s focus was on action: better tool use and more agentic reasoning.
GPT-5 performs better on some benchmarks measuring agentic tasks and is on a par with others.2 And compared with the last generation, the raw jumps are striking: on GPQA Diamond (advanced science questions), GPT-4 scored 38.8%, GPT-5 scored 85.7%;3 on ARC-AGI-1, GPT-4o managed 4.5%, GPT-5 hit 65.7%.
Yet the wow factor is muted. Most people are not measuring GPT‑5 against GPT‑4 from March 2023. They are stacking it against o3 from just a few months ago. Frontier models arrive in rapid succession, the baseline shifts at speed and each breakthrough lands half‑forgotten. In that light, even a giant’s stride can feel like treading water.
2. The reliability paradox
As systems grow more reliable, their rare failures become less predictable and more jarring. Trust can stagnate – or even decline – despite falling error rates.
On paper, GPT‑5 should be more reliable than previous LLMs. OpenAI’s launch benchmarks suggest it hallucinates far less than o3, especially on conceptual and object‑level reasoning.
In my own use, hallucinations feel rarer. GPT‑5 Thinking aced a 51‑item, nine‑data‑point analysis I gave it and added a derivative analysis I had not asked for. Claude Opus 4.1, by contrast, miscounted the items and gave weaker recommendations. GPT‑5’s output took me 30 minutes to verify in Excel – not because it was wrong, but because the data format was awkward. Across simpler tasks, this is the pattern: more accurate, more often.
The problem is when it is wrong. During a recent trip to Tokyo, I asked GPT‑5 to name the city’s oldest Italian restaurant while standing under it. It named a different place, yet also knew the full history of the restaurant I was in when I prompted it harder. The same kind of jarring mistake popped up in OpenAI’s live demo, where GPT‑5 botched the Bernoulli effect. These errors are not frequent, but they are unpredictable, and that makes them dangerous.
Psychologists call this automation complacency: the more reliable a system is, the less closely we watch it, and the more likely rare errors are to slip through. With GPT‑4‑level error rates, I stayed alert for slip‑ups; with GPT‑5, I can feel myself letting my guard down. The brain’s ‘error detection’ system habituates, so vigilance drops.
This risk compounds in agentic workflows. Even with a 1% hallucination rate, a 25-step autonomous process has roughly a 22% chance of at least one major error. For enterprise use, that is still too high. Last week, AWS released Automated Reasoning Checks, a formal-verification safeguard that encodes domain rules into logic and mathematically tests AI outputs against them. They tout “up to 99% accuracy.” This will help, but it’s not the last word.
Nevertheless, when mistakes are rarer yet less predictable, perceived reliability does not climb as much as the benchmarks suggest. That is why GPT‑5’s improved accuracy can still feel like a modest leap. The progress is real, but it does not fully translate into user confidence.
3. The benevolent-control paradox
The more capable the assistant, the more its “helpful” defaults shape our choices – turning empowerment into subtle control.