đ§ We are all spiky
Weâve done this trick once, turning spiky humans into civilisation. Weâre about to do it again.
âI subscribe because you are naming fractures that matter.â â Lars, a paying member
Andrej Karpathy argues weâve got one common AI metaphor wrong. We talk about training AI as if weâre training animals, building instincts and ultimately, shaping behaviour. No, he says in his year-end review:
Weâre not âevolving/growing animalsâ, we are âsummoning ghostsâ. Everything about the LLM stack is different (neural architecture, training data, training algorithms, and especially optimization pressure) so it should be no surprise that we are getting very different entities in the intelligence space, which are inappropriate to think about through an animal lens.
He goes on to say that the âghostsâ
are at the same time a genius polymath and a confused and cognitively challenged grade schooler, seconds away from getting tricked by a jailbreak to exfiltrate your data.
Weâre familiar with that jaggedness. Ethan Mollick and colleagues ran a field experiment two years ago involving management consultants and GPT-4. Inside the modelâs competence zone, people go faster (25%) and their outputs are better (40%). But on a task outside the frontier, participants became far more likely to be incorrect. Same tool, opposite effects.
This jaggedness has more recently surfaced in Salesforceâs Agentforce itself, with the senior VP of product marketing noting that they âhad more trust in the LLM a year agoâ. Agentforce now uses more deterministic approaches to âeliminate the inherent randomnessâ of LLMs and make sure critical processes work as intended every time.
Andrejâs observation cuts deeper than it may initially appear. The capability gaps he describes are not growing pains of early AI. They are signatures of how training works. When a frontier lab builds a model, it rewards what it can measure: tasks with clear benchmarks, skills that climb leaderboards and get breathlessly tweeted with screenshots of whatever table they top.
The result is a landscape of jagged peaks and sudden valleys. GPT-4 can pass the bar exam but struggles to count the letters in âstrawberry.â And yes, the better models can count the ârâs in strawberry but even the best have their baffling failure modes. I call them out regularly; and recently Claude Opus 4.5 face planted so badly that I had to warn it that Dario might need to get personally involved:
What gets measured gets optimised; what doesnât, doesnât. Or, in Charlie Mungerâs way, show me the incentive and Iâll show you the outcome.
The benchmarks a system trains on sculpt its mind. This is what happened in the shift to Reinforcement Learning from Verifiable Rewards (RLVR) â training against objective rewards that, by principle, should be more resistant to reward hacking. Andrej refers to this as the fourth stage of training. Change the pressure, change the shape.
The essay has raised an awkward question for me. Are these âghostsâ something alien, or a mirror? Human minds, after all, are also optimised: by evolution, by exam syllabi, by the metrics our institutions reward, by our deep family backgrounds. We have our own jagged profiles, our own blind spots hiding behind our idiosyncratic peaks.
So, are these âghostsâ that Andrej describes, or something more familiar? Our extended family?
Psychometrics has argued for a century about whether thereâs a general factor for intelligence, which they call g, or if intelligence is a bundle of capabilities. The durable finding is that both are true. Cognitive tests positively correlate and this common factor often explains up to two-thirds of variation in academic and job performance.
But plenty of specialized strengths and weaknesses remain beneath that statistical umbrella. Humans are not uniformly smart. We are unevenly capable in ways we can all recognize: brilliant at matching faces, hopeless at intuiting exponential growth.
The oddity is not that AI is spiky â itâs that we expected it wouldnât be.
We carried over the mental model of intelligence as a single dial; turn it up, and everything rises together. But intelligence, human or machine, is not a scalar. Evolution optimized life for survival over deep time in embodied creatures navigating physical, increasingly crowded and ultimately social worlds. The result was a broadly general substrate which retains individuated expression.
Todayâs LLMs are selected under different constraints: imitating text and concentrated reward in domains where answers can be cheaply verified. Different selection pressure; different spikes. But spikes nonetheless.
Spikiness & AGI
What does spikiness imply for AGI as a single system that supposedly does everything?
My bet remains that we wonât experience AGI as a monolithic artefact arriving on some future Tuesday. It will feel more like a change in weather: itâs getting warmer, and we gradually wear fewer layers. Our baseline expectation of what âintelligent behaviourâ looks like will change. It will be an ambient condition.
The beauty is that we already know how to organise such spiky minds. Weâre the living proof.

Human society is one coordination machine for spiky minds. Itâs the surgeon with no bedside manner. The mathematician baffled by small talk. The sales rep who can close a deal on far better terms than the Excel model ever allowed.


