š¤ Why o3 missed what readers caught instantly
The jagged frontier of AI reveals itself when frontier models fail at basic fact-checking.
In AugustāÆ1997, Microsoft Word urged a friend to replace the phrase āwe will not issue a credit noteā with the polar opposite ā an autoāconfabulation that could have unleashed a costly promise.
Fastāforward to 17āÆJulyāÆ2025: our AI-powered factāchecker read a sentence claiming SenatorāÆDaveāÆMcCormick was a mere āhopeful candidate.ā It labelled that fact as correct. Two eras, two smarterāthanāus machines, one constant flaw: when software speaks with misplaced certainty, humans nod. Letās unpack why.
We use an LLM-powered fact-checker to screen each edition. This fact checker uses o3 (which has access to web search) to decompose the draft into discrete claims and checks them against external sources. This systems runs alongside human checkers.
For this item, the mistake that McCormick was a Senate hopeful rather than a sitting senator slipped through.
And to be fair, the sentence didnāt seem outrageous at a glance. The central point was about the scale of US investment in clean energy ā hundreds of billions in potential funding. The institutional detail ā Senator or Senate candidate ā seemed secondary. But thatās exactly the problem. The model, and to some extent our human reviewers, prioritized the big thematic facts and let the specifics slide.
The final catch didnāt come from an LLM. It came from eagle-eyed readers who had the context and were quick to spot the mistake.
Once we realised what had happened, we tried to diagnose the problem. We ran the section through a number of different LLMS (including o3, o3 Pro, Perplexity and Grok). None of them spotted the problem.
We refined the prompts based on feedback from the models, but the problem persisted, even when we explicitly instructed the model to verify peopleās roles. Here was iteration three:
The LLM has noticed the discrepancy between our original text and its own discovery. The process continued until we found a cumbersome prompt that identified the mistake.
Weirdly,
ran the text & our most basic prompt through the Dia Browser. Dia draws from a range of different LLMs and it found the problem immediately.In other words, the most basic tools outperformed the most advanced ones. This is a textbook illustration ofĀ AIās jagged frontier,Ā which describes how AI excels at some cognitive tasks while failing unexpectedly at others, with no smooth boundary between the two.
This was instructive failure. Hereās what it taught us.
1. Our hybrid human-AI workflow needs a rethink
Our current editorial process uses LLMs as a first line of review, before human editors step in. The assumption is that the models will catch the obvious mistakes and that our team will catch the subtle ones.
But this case reveals a deeper flaw: the models didnāt catch what should have been obvious because the main point of the story was Americaās new industrial policy. A good human subeditor would have caught (or at least checked the claims about McCormick, who isnāt as well known as Trump.) Thatās the trust trap. Silence masquerades as certainty. When an LLM returns no objections, our cognitive guard drops; we mistake the absence of alarms for evidence rather than ignorance. How do you avoid it?
Obviously, our human processes need a review, and they will become more onerous. Equally, our automated fact-checking may need to become a multi-step or parallel pipeline with different systems evaluating different classes of claims. I already do this earlier in my research process. I tend to use a couple of different LLMs to start to frame an issue, and use their points of concordance and disagreement as jumping off points for further research.
2. As AI gets embedded into workflows, these risks scale
Keep reading with a 7-day free trial
Subscribe to Exponential View to keep reading this post and get 7 days of free access to the full post archives.