In the corridors of Silicon Valley’s most secretive AI labs, a quiet revolution is unfolding. Headlines scream of stalled progress, insiders know something the market hasn’t caught up to yet: the $1 trillion bet on AI isn’t failing—it’s transforming.
The real story? AI’s most powerful players are secretly shifting away from purely brute-force scaling that defined the last decade. Instead, they’re pursuing a breakthrough approach that could continue to deliver.
The strategy of using scale by training ever larger models on greater training data, is yielding some diminishing returns. Here’s The Information
Some researchers at the [OpenAI] believe Orion, [the latest model], isn’t reliably better than its predecessor in handling certain tasks, according to the employees. Orion performs better at language tasks but may not outperform previous models at tasks such as coding.
Google’s upcoming iteration of its Gemini software is said to not meet internal expectations. Even Ilya Sutskever, OpenAI’s chief scientist, has remarked that scaling has plateaued
The 2010s were the age of scaling; now we’re back in the age of wonder and discovery once again.
If this is true, does this mean the trillon-dollar bet on bigger and bigger AI systems is coming off the rails?
The short answer is no.
Here is how to think about it.
Expectations are high, outlook confused
The leap from GPT-1 to GPT-2 was remarkable, and the step from GPT-2 to -3 was substantial. If you are working at one of the frontier labs, you are working to a timeline toward what they call artificial general intelligence that has collapsed. A decade ago, the median AI researcher thought AGI was decades away. Today, the bosses of the major labs believe there is a possibility of reaching AGI within a few years. For example, Sam Altman and Dario Amodei respectively tout 2025 and 2027 as the years when AGI will arrive.
The language of AGI, from the perspective of a frontier lab, is of a god-like technology. Solve it, and you solve everything else. That is a project with a huge reward. Such expectations coupled with industrial secrecy, incentives to continuously advertise progress and the limitations of the measures of progress (which we’ve discussed before) creates a lot of confusion. On the other hand, the media and AI critics may have their own reasons for wanting or predicting a slowdown — whether from legitimate concerns about AI’s rapid progress or a desire to see ambitious tech industry predictions fail.
It’s certainly the case that existing large language models are so complicated they do behave unpredictably. As
points outresearchers who release the models aren’t unbiased scientists, so they don’t test on every single benchmark ever. They are more likely to choose benchmarks that show improvement in the announcement. But I’ve noticed online that private benchmarks often diverge from public ones…I think it’s likely that over the past year private benchmarks show more variable performance … compared to the public benchmarks everyone uses.
In many discussions with AI researchers in Silicon Valley last week, I heard the constant refrain that people building foundation models couldn’t get their hands on enough compute to explore fruitful avenues.
What is a hurdle for a frontier lab may be irrelevant for those of us using the technology over the next couple of years.
The new scaling
Technological progress rarely progresses monotonically, that is in a smooth trajectory. Instead, you tend to reach plateaus that require new strategies to navigate. When you are up-close these plateaus look significant, like the chart below.
As I pointed out in July, with a bit more historical distance, waves of innovation overlap and give us the smooth curve of increasing progress. Bumps are to be expected, and disappointing for roadmaps and market expectations if communicated poorly. But bumps force builders to come up with new approaches.
As Ilya Sutskever said two months ago:
Everyone just says scaling hypothesis. Everyone neglects to ask, what are we scaling?
We now have a clue what this new scaling could look like. Instead of funnelling all computational resources into training larger models, companies are now dedicating more compute power to the inference phase. This is the period when the model is actively being used. If you have used ChatGPTs o1 model this is the point after you ask your question where it works through your question step by step.
This approach gives models like GPT-4 o1 more “thinking time”. For example, it manages to create a chain of plausible events linking the disappearance of lizards to the US official language becoming Spanish (see here for full chat).
Rather than producing an answer in a single pass, the model reflects on various scenarios before determining the most plausible sequence of events. It generates and evaluates multiple solution attempts, critiques its own reasoning and synthesises improved solutions. This method has a scaling-law-type effect: the more compute allocated during inference, the better the performance.