🔮🇨🇳 Inside the Chinese AI labs where America’s AI controls created its toughest competition
We put a number to China’s efficiency moat
We spent a week on the ground in China, visiting AI and robotics labs and seeing how things operate firsthand. We traveled through Beijing, Hangzhou and Shanghai to meet representatives from 14 labs, including DeepSeek, MoonshotAI, MiniMax, Z.ai, ByteDance, 01.AI, Alibaba, Ant Group, Xiaomi, AInnovation, Galbot, Unitree, ModelScope, and RWKV; joined by our friends Kevin Xu, Lily Ottinger, Florian Brand, afra, Kai Williams, Lingua Sinica, Jasmine Sun, Caithrin and Nathan Lambert who made it all happen.
We participated in dozens of hours of discussions with researchers, founders, product leaders, and business owners across the infrastructure, hardware, models and application layers. Every lab is obsessed with ByteDance’s Doubao, and respectful of DeepSeek’s scientific process. Claude is the model of choice for coding, universally rated as the best thing out there. The researchers we met were humble, welcoming, focused purely on technical priorities and building the next big model. Researchers were also very young: at one lab in particular, the average age was 25.
But chip constraints are real. Everyone wanted Nvidia chips, and the constraint was showing up in longer and longer pre-training runs and iterative cycles.
We arrived in China and really wanted to understand how severely export controls were biting and how much they were harming AI development. The stock of AI compute in China is running two to three years behind that of the US. In the short term, it’s clear that the controls are making it harder.
But, as we discovered, it’s not as obvious over the long term.
The export controls have become capability-generating – labs in China are forced to be ruthlessly efficient. Despite the three-year compute handicap, Chinese open-source models are only six to eight months behind the US frontier.
By listening to the researchers and digging into the data, we have estimated quantitatively just how significant that efficiency capability is. We reckon Chinese labs are extracting 4-7x as much intelligence per unit of compute as naive scaling predictions would suggest.
On the day a sitting US president lands in Beijing for the first time in nearly a decade, we decided to publish our research into how and why the constraints have inadvertently created the conditions for the most formidable competitors to develop exactly the capabilities that will matter most in the coming years.
The compute gap, the US lead
In every meeting with every Chinese lab, we heard a common refrain: we do not have enough compute. Less compute means fewer experiments and smaller models. This is a genuine constraint on research, development and deployment of AI.
That isn’t surprising. American researchers complain, too. So do the business people, Microsoft and Anthropic, for example, have explicitly stated that the lack of compute capacity has cost them meaningful revenue.
But the compute constraint in China is different. It isn’t just that there is less capital around – China’s AI startups raised $12.4 billion in 2025 compared to $285 billion in the US. It is that export controls on chips, initiated by Joe Biden in October 2022 and consequently relaxed and tightened at various times by President Trump, have all but choked off the supply of advanced chips to the Chinese market. The interesting thing to us was seeing firsthand how local labs have responded.
Let’s step back for a moment first to put into context the compute gap.
US labs trumpet securing large amounts of compute. In recent weeks and months, Anthropic alone has signed deals totaling over 10 gigawatts of capacity – with Amazon, Google, Microsoft, Nvidia and SpaceX. OpenAI committed to 10 gigawatts of Nvidia systems last September, backed by up to $100 billion in Nvidia investment. These orders are now exclusively for the latest, most powerful silicon: Nvidia’s Blackwell series (B200, B300, BG200) shipping today and the next-generation Vera Rubin platform arriving later this year, and increasingly Google TPUs and others.
These large orders simply are not options for Chinese hyperscalers and labs. Supply isn’t entirely dry, Chinese customers are still getting their hands on Nvidia’s H100s, B200s and B300s. These are coming through from Singapore, by and large, via shell companies where shipments are relabelled as tea or toys. But quantities are at least an order of magnitude below their US rivals.
It is these top-tier American chips, especially the most recent vintages, that count. A single GB300 NVL72 rack (72 of Nvidia’s latest GPUs operating as one system) delivers 30x faster real-time inference than the equivalent H100 cluster from three years earlier, with 3.6x more memory per chip and 25x lower energy per inference. US labs are now ordering these systems by the gigawatt. Chinese labs cannot.
Chinese tech firms, notably Huawei, have made strides in building chips suited for AI. But even Huawei’s latest, the Ascend 950PR, launched in March, is roughly on par with the H100, released in 2022. The systems are shipping in far smaller volumes. NVIDIA is estimated to have shipped 7 million Hopper and Blackwell GPUs through October 2025 alone and the rate is increasing. Huawei plans to ship 750,000 Ascend 950PR chips this year, which is still around a tenth of what Nvidia shipped last year.
The result is that the US has a staggering lead in deployed AI compute capacity.
The lead is widening, not shrinking. In 2023, the US AI sector had triple the amount of deployable compute–almost all of which would have focused on training AI models. By the start of this year, that gap was closer to eightfold.
Put differently, by the end of 2025, Chinese labs could likely access roughly the same scale of compute that the US enjoyed two years earlier.
The difference is how that compute is used. In 2023, most American capacity was tied up in training, not serving customers. By contrast, in 2025, China’s compute stack, augmented by data centers in Malaysia and Singapore, was doing double duty – supporting model training and serving hundreds of millions of consumers, and a rapidly growing base of enterprises, through apps like WeChat, Doubao and Alipay.
It’s important to separate compute capacity for training AI models from serving customers. China has a huge AI market. Doubao alone reaches 100 million daily active users. Token volumes are equally vast. By February 2026, we estimate Chinese token volumes had reached ~9 quadrillion tokens a month – compared to ~4 quadrillion across the main US/Western providers.
Alongside datacenters in Malaysia and Singapore, a large part of Chinese compute infrastructure is going to serve customers through inference. If half of the compute is used to serve those customers, that reduces the available compute for training models. We might conclude, with low confidence, that by the end of 2025, Chinese labs had as much compute available for model training as American labs did in mid-2023.
By that logic, the performance of models from Chinese labs should be at least two years behind American if labs in both countries are using the same approach – more computing and more data to build better models1. The framework treats capability as a function of compute, holding training efficiency roughly constant.
But we aren’t seeing a 2-3 year gap.
The headline is that Chinese models are three to six months behind the US on benchmark performance, according to DeepSeek, and 8 months according to the Center for AI Standards and Innovation, a US government agency.
In fact, Chinese labs appear to be keeping pace with, or perhaps even narrowing the gap in some ways, with US labs. The question for us then became: are the capability headlines wrong, or is something closing the gap that the computed numbers don’t capture?
There is an additional wrinkle: market structure. In the US, five key frontier labs dominate training compute: OpenAI, Anthropic, Google DeepMind, Meta and xAI. In China, a thousand flowers are blooming. The big tech firms are developing their own frontier models:
And even larger firms are coming in, either because they have specific data and expertise, such as Ant Financial, with their Ling series, or Meituan, known for its on-the-go delivery-retail platform, which has also entered the LLM development market.
The impact of so many firms training their own models is that the pool of compute is being divided still further.
The efficiency moat strikes back
These labs are clearly finding efficiencies in training performant models. The efficiencies are actually being passed through to the models’ inference. That is, when they’re being used to serve customers, because they are much cheaper than roughly equivalent American models.
One shouldn’t put too much weight on AI benchmarks. They can be gamed, and they might not easily reflect how a models “feels” or works in practice. But they are one inadequate reference point. DeepSeek’s V4 Pro, their flagship model, is comparable to Claude’s Opus 4.6, in some ways. Opus 4.6 was released in Feb 2026, and is not Anthropic’s latest model. Cost-wise, though, you can see the difference. DeepSeek charges $0.43 per million input tokens and $0.87 per million output tokens. Opus 4.6 is 11 times more costly in input and 28 times more expensive in output.
These are not promotional one-offs. Across the Chinese frontier, Kimi K2.6 sits at $0.95 input (among the cheapest models in the global top 10 by GPQA Diamond), and Alibaba’s Qwen models are priced in a similar band. The cost-to-serve inference is a function of three factors: the actual cost of serving the model, its compute complexity and energy costs, and the margin the provider is willing to give up.
The margins appear largely healthy. Z.ai serves its GLM-5 model at $1.00 per million input tokens, which is 3x cheaper than Claude Sonnet 4.6, and 5x cheaper on output. Despite this, it boasts a 50% gross margin, and MiniMax enterprise margins sit at 70%, though we don’t know if this holds across the board. DeepSeek, for its part, ran for years on internal funding alone, only turning to outside capital this month.
Finally, that efficiency shows up in how easily these models run on consumer hardware like laptops and phones. The leading local models in the world are almost all Chinese open-source, aggressively distilled down to smaller, lighter variants. A 5GB Qwen3-8B model runs on my Mac, as does DeepSeek R1’s 7B distilled variant, which has been pulled 85 million times on Ollama, the second-most-downloaded local model in the world. Cursor even built its Composer 2 model on top of MoonshotAI’s Kimi K2.5 model. The only US-based open-source model we run locally is Google’s Gemma 4.
How did they do it?
Our in-depth conversations with the labs point to three key features that enable this sustained efficiency.
The first is the ecosystem itself. Inside the labs, people described how “卷”, juǎn, the ecosystem is. Juǎn literally means “to curl inwards”, it’s slang for an arms race so intense you are running hard just to stay in place.
In Chinese internet culture, this falls under a broader concept sometimes called “内卷”, nèijuǎn, or involution. This is the self‑reinforcing competition with diminishing returns. The labs are competing ferociously. That is hardly surprising in a system where industrial policy actively cultivates competition between cities and where outpacing Shanghai or Hangzhou matters more than catching San Francisco or King’s Cross.
This competition creates a culture that demands long hours and hard work to achieve results. In many cases, the key research is led by PhD students, while they are pursuing their studies. In Shanghai, engineers took us clubbing and left the venue at 1am to go back to work. At another lab, Lily Ottinger stumbled on camping beds for researchers who didn’t go home. It’s not the kind of thing that goes in a handbook.
The Chinese ecosystem remains fairly open: they still publish regularly, mostly release open-source models and know-how spreads as researchers move from lab to lab. Know spreads horizontally across players.
The clearest example of this open-source feedback loop is Multi-head Latent Attention (MLA), an architectural technique used inside LLMs. DeepSeek introduced it in V2 in May 2024 to compress the attention cache, reducing memory use by more than 93% while improving model quality. Within a year or so, it was inside Moonshot’s Kimi K2, Z.ai’s GLM-5, Ant Group’s Ling 2.5, and every DeepSeek model since. Xiaomi’s MiMo V2-Pro model is another excellent example. It builds directly on two efficiency techniques first pioneered by DeepSeek in 2024, then shared openly and adopted by others.
The biggest exception to this open culture is the mammoth ByteDance, which has unparalleled capital, compute and distribution scale. In fact, in our conversations, ByteDance was clearly the 800lb gorilla, discussed in a particular tone best described as the most positive intersection between hushed, envious and concerned.
Our hosts walked us through nearly every major architectural innovation reshaping frontier AI – DAPO, MLA, GRPO, mixture-of-experts optimization – each a technique for extracting more capability from leaner compute budgets. One lab adopts only architectural changes that deliver 20% or better efficiency improvements; any experiments that deliver 10% or less are abandoned.
In architecture, a technique called MLA compresses the attention cache by 93%, meaning each conversation occupies roughly a tenth of the GPU memory as otherwise. This parsimony means a given chip can handle many more parallel conversations. Similarly, DeepSeek V4 Flash uses a quantization technique that mixes FP4 and FP8.2 Compared to its previous generation, V4 Flash scores higher on benchmarks, stretches the context window to eight times its former length, and still uses only 10% of the memory and 27% of the FLOPs for inference.
Has this had a real effect? Today, we can show it has. To try to quantify it, we have used an efficiency multiplier. This is a calculation we’ve put together to compare how capable the models are compared to where they ought to be given compute constraints.
We found that Chinese labs are extracting 4-7x more intelligence per unit of compute than naive scaling predictions would suggest. In time saved? This looks like 2-3 years of efficiency gains.
The future shape of the AI economy
The shape of the AI compute workloads is changing. Less training, much more inference. We’re moving from development to deployment. While development won’t end, the labs will continue to train increasingly capable models. Training capabilities and inference capabilities will become more important. Inference capacity will become more important. One conclusion we drew from our discussions was that the cultural, technical, and management affordances we saw in Chinese labs may well benefit them as we transition to this new phase.
Agents everywhere
The scale of this shift is already visible. Jensen Huang told GTC, Nvidia’s annual jamboree of self-celebration, in March that compute demand has grown a million-fold in two years and that agentic AI is about to trigger another exponential leap. At those levels, cost and efficiency will matter.
We’ve seen a vast expansion, a couple of orders of magnitude, in token use per user as we have moved to agents. Agents will get more reliable, and this expansion will continue. It could be measured in the billion-fold increase in inference demand over the next few years. You can try our model here.
At a personal level, I’ve switched my latest agent – R Grouchy – to MiMo Pro V2.5 specifically because the token costs are a fraction of Claude’s. With my agent token usage ballooning past 100 million tokens daily, this makes a difference. At the scale agents operate, even small cost differences compound into meaningful budget gaps.
Heading to the edge
Over the coming few years, more and more AI will be consumed on the edge, that is, on devices closer to where the intelligence is needed. think about in robots or in autonomous cars or in devices in your home, just on your phone. Xiaomi already has an empire of 750 million devices, from thermostats to cars, that are starting to integrate AI into daily life.
Obviously, you can’t run a titanic frontier model on an Edge device. Today, it takes about 6 to 8 months for Frontier Model Performance to become available as open-source and open-weight models. and from there, roughly another 6-12 months for techniques like distillation to shrink the models down enough to run on consumer devices, like laptops and phones.
Most of the models that do this are the Chinese open source models: Qwen, DeepSeek and MiMo. Models built for scarce compute are already shaped for this emerging environment. And, at some point, those edge devices may be robots. Chinese firms are already shipping: Galbot’s humanoids are running autonomously in warehouses and in specially purposed pharmacies.
The price is right
Even rich American companies like Uber are becoming sensitive to the cost of running AI. Uber’s CTO Praveen Neppalli Naga revealed that Uber exhausted its full 2026 AI budget by April, thanks to its 5,000 engineers getting Claude-pilled. CFOs are increasingly sensitive to token bills, infinite is not an otion.
Internationally, this creates a huge opportunity. Z.ai’s third-largest market is Indonesia, a price-sensitive market. MiniMax generates over 70% of its revenue outside China. Beyond the US and Western Europe, are millions of businesses and hundreds of millions of consumers where price may yet prove decisive.
From a moat to a system
I went to China eager to listen and learn. Five years ago, I’d argued that it was
highly unlikely that the US can stop China developing a semiconductor industry…this might result in the development of two different technology ecosystems.
This was before ChatGPT, of course. Today is similar but different. While Chinese semis independence is increasing, there is a related split developing higher up the stack in three ways.
First, the efficiency moat
Chinese labs may be building a competitive moat, one that is designed around these principles of efficiency. Good moats last. And we hypothesise that Chinese labs may have be building an efficiency moat, where they can consistently train competitive models at much lower cost than rivals and then serve these models at a much lower cost per token.
The practices of parsimony are leading to leaner but competitive models.
Could this lead to long-term advantages? Possibly: if the market evolves to greater inference on heterogenous infrastructure and a more diverse set of customers.
Second, a new managerial theory is emerging
We might also be witnessing a new managerial theory for running AI labs emerge. The best parallel I can think of is the Toyota Production System (TPS), designed by Taiichi Ohno. Like Chinese AI labs, Toyota had little choice but to build its products a different way. Steel was expensive & capital was scarce. Toyota was a cash-poor endeavour. Using TPS, Japanese firms got to a cost structure Detroit couldn’t match, while progressively improving the quality of their products. Even as Japanese automakers saw their coffers swell with profits, the efficiency and cost advantages remained.
An analogous process may be emerging among Chinese AI labs. A lab can’t run all three workstreams in parallel at full capacity. Instead, it must allocate compute to whichever team is most ready to use it productively at a given moment—and that allocation discipline ultimately becomes a capability. The US frontier labs, operating on effectively uncapped Nvidia silicon, find their forcing function much further out.
I don’t want to overlabor the analogy. What is happening in China is not identical to the TPS in many ways, nor is it as mature. But we should not ignore the structural implications. If inference is where value will accrue next, and if one approach yields much more efficient inference, then the export controls achieved the opposite of their intent.
This is a different ecosystem
Before I travelled to China, I was keenly looking for signs of a clean hardware split – the US on Nvidia, China increasingly locked onto Huawei. The reality is messier and more interesting than that. The full ecosystem, the hardware, the model-building culture, the internal capabilities, the hiring priorities, and the way decisions get made feel distinct.
NVIDIA remains the dominant compute choice in China, for now. But that is shifting, and will shift faster as workloads tip toward inference rather than training and as China develops more homegrown chips. But more than the hardware question, is the question of how the culture and its relationship to talent is evolving. Culture is what has shaped Silicon Valley over the past 50 years. And China’s ecosystem is find its own path, not merely a slower, constrained copy of Silicon Valley’s.
The party noticed
During the course of our trip, Qiushi, the CCP’s official journal of policy theory, published a report with CCID, an industrial policy think tank nestled in China’s Ministry of Industry and Information Technology. This was a useful proxy for what Beijing wants its bureaucrats, researchers and industrial champions to believe about AI.
China, the report argued, is no longer catching up. It is now in the “first echelon”, leading in some areas, lagging in others. Open-source, domestic compute, high-quality data, and rapid industrial deployment were approved as the route to develop a new national production system.
It openly acknowledges the damage that export controls have caused, forcing domestic teams to slow development due to “computing-power hunger.” To alleviate this, it recommends “deep coupling among computing-power providers, model providers, and industry users.” We saw some evidence of this as many of the labs are helping Huawei design the next generation of Ascend chips.
The elastic effect
Export controls were designed to freeze China out of frontier AI by choking off access to high-end compute. It is undeniable that they have had a serious impact. But the constraints have fostered new capabilities built around efficiency.
The labs we visited fostered a culture of compounding research. They’re open, selective about what works, and are optimising relentlessly in a way that fits the shape of the future. It’s hard to know whether they would give up those practices if a spigot of GPUs magically opened up for them. Of course, they’d likely lean into that new capacity, but I would wager they wouldn’t give up the unique characteristics that have kept them competitive until now.
Read the accounts of our fellow travelers on the same trip:
And a special thanks to Caithrin for organizing the trip!
This scaling approach is known as Chinchilla scaling, meaning the same formulas and the same logic to set parameter counts and training data ratios.
FP4 means each model weight is stored in just 4 bits instead of the usual 16 or 32 – a 4-8x memory reduction. The problem with quantization is that it compresses data and makes models less accurate. DeepSeek addresses this by mixing FP4 and FP8 at different layers, which is why inference FLOPs drop to 27% of the previous generation without a meaningful quality loss.











👏👏👏👏👏
Great piece but needed one more read through for grammar and typos. There are quite a few in there.