🧠 We are all spiky

We’ve done this trick once, turning spiky humans into civilization. We’re about to do it again.

Dec 23, 2025

∙ Paid

“I subscribe because you are naming fractures that matter.” – Lars, a paying member

Andrej Karpathy argues we’ve got one common AI metaphor wrong. We talk about training AI as if we’re training animals, building instincts and ultimately, shaping behaviour. No, he says in his year-end review:

We’re not “evolving/growing animals”, we are “summoning ghosts”. Everything about the LLM stack is different (neural architecture, training data, training algorithms, and especially optimization pressure) so it should be no surprise that we are getting very different entities in the intelligence space, which are inappropriate to think about through an animal lens.

He goes on to say that the “ghosts”

are at the same time a genius polymath and a confused and cognitively challenged grade schooler, seconds away from getting tricked by a jailbreak to exfiltrate your data.

We’re familiar with that jaggedness. Ethan Mollick and colleagues ran a field experiment two years ago involving management consultants and GPT-4. Inside the model’s competence zone, people go faster (25%) and their outputs are better (40%). But on a task outside the frontier, participants became far more likely to be incorrect. Same tool, opposite effects.

This jaggedness has more recently surfaced in Salesforce’s Agentforce itself, with the senior VP of product marketing noting that they “had more trust in the LLM a year ago”. Agentforce now uses more deterministic approaches to “eliminate the inherent randomness” of LLMs and make sure critical processes work as intended every time.

Andrej’s observation cuts deeper than it may initially appear. The capability gaps he describes are not growing pains of early AI. They are signatures of how training works. When a frontier lab builds a model, it rewards what it can measure: tasks with clear benchmarks, skills that climb leaderboards and get breathlessly tweeted with screenshots of whatever table they top.

The result is a landscape of jagged peaks and sudden valleys. GPT-4 can pass the bar exam but struggles to count the letters in “strawberry.” And yes, the better models can count the “r”s in strawberry but even the best have their baffling failure modes. I call them out regularly; and recently Claude Opus 4.5 face planted so badly that I had to warn it that Dario might need to get personally involved:

What gets measured gets optimised; what doesn’t, doesn’t. Or, in Charlie Munger’s way, show me the incentive and I’ll show you the outcome.

The benchmarks a system trains on sculpt its mind. This is what happened in the shift to Reinforcement Learning from Verifiable Rewards (RLVR) – training against objective rewards that, by principle, should be more resistant to reward hacking. Andrej refers to this as the fourth stage of training. Change the pressure, change the shape.

The essay has raised an awkward question for me. Are these “ghosts” something alien, or a mirror? Human minds, after all, are also optimised: by evolution, by exam syllabi, by the metrics our institutions reward, by our deep family backgrounds. We have our own jagged profiles, our own blind spots hiding behind our idiosyncratic peaks.

So, are these “ghosts” that Andrej describes, or something more familiar? Our extended family?

Psychometrics has argued for a century about whether there’s a general factor for intelligence, which they call g, or if intelligence is a bundle of capabilities. The durable finding is that both are true. Cognitive tests positively correlate and this common factor often explains up to two-thirds of variation in academic and job performance.

But plenty of specialized strengths and weaknesses remain beneath that statistical umbrella. Humans are not uniformly smart. We are unevenly capable in ways we can all recognize: brilliant at matching faces, hopeless at intuiting exponential growth.

The oddity is not that AI is spiky – it’s that we expected it wouldn’t be.

We carried over the mental model of intelligence as a single dial; turn it up, and everything rises together. But intelligence, human or machine, is not a scalar. Evolution optimized life for survival over deep time in embodied creatures navigating physical, increasingly crowded and ultimately social worlds. The result was a broadly general substrate which retains individuated expression.

Today’s LLMs are selected under different constraints: imitating text and concentrated reward in domains where answers can be cheaply verified. Different selection pressure; different spikes. But spikes nonetheless.

Spikiness & AGI

What does spikiness imply for AGI as a single system that supposedly does everything?

My bet remains that we won’t experience AGI as a monolithic artefact arriving on some future Tuesday. It will feel more like a change in weather: it’s getting warmer, and we gradually wear fewer layers. Our baseline expectation of what “intelligent behaviour” looks like will change. It will be an ambient condition.

The beauty is that we already know how to organise such spiky minds. We’re the living proof.

“Evolution optimized life for survival over deep time in embodied creatures navigating physical, increasingly crowded and ultimately social worlds. The result was a broadly general substrate which retains individuated expression.”

Human society is one coordination machine for spiky minds. It’s the surgeon with no bedside manner. The mathematician baffled by small talk. The sales rep who can close a deal on far better terms than the Excel model ever allowed.

Exponential View

🧠 We are all spiky

We’ve done this trick once, turning spiky humans into civilization. We’re about to do it again.

Spikiness & AGI

This post is for paid subscribers