📈 Why AI bills rise as costs fall
Agents eat tokens at rates that are impossible to forecast
Hi all,
Last week, we explored what tokenmaxxing means for CFOs and how firms can buffer unexpected AI costs.
We go further today to show you why AI bills are hard to forecast today and what will happen as we crack that problem.
The token explosion paradox
We estimate that the number of tokens processed per quarter has grown by around 17,000x over four years.1
Token prices have collapsed during this time. Demand for machine intelligence is highly elastic, meaning that as prices fall, consumption increases by more than the decline in price.
One reason is that cheaper tokens have made agents economically viable. At the same time, agents use tokens at rates that are orders of magnitude higher than those of chatbots for single-turn queries. That shows up as the total tokens processed per output token—advanced models do a lot of processing below the surface that a user doesn’t see.
A lot of this growth is driven by China’s domestic demand and its model providers, especially ByteDance and Alibaba.
The cost of the ghost token
When you use an AI agent, the final result you get is really just a summary of all the work the agent has undertaken. There may be dozens of tool calls to browse the web or load up a file to check and validate the work it has done. All of these are steps consume tokens: they become hidden multipliers.
The first of these is token amplification.2 A coding agent that operates over 10 turns might need to re-read its full context every turn. That repetitive reading of context could use as many as 55x more tokens than a single-turn query for the same task.
Actual active inference is probably only 15-20% of the total token consumption. The rest is invisible work that you, as a user, and possibly the company paying for it all, haven’t modelled.
The long tail of tool calls
Agents make anywhere between five and twenty-five tool calls per task. And each call adds more context, tokens and API costs. It also increases the likelihood that the model will need to retry the task to get it right.
The price of safety
Governance and safety costs3 are between 20% and 40% of total spend. Amazon Bedrock charges $0.15 per 1,000 text units for content filters, and Azure charges $0.38 per 1,000 text units. When you add those charges on top of model inference, you could be looking at 15-30% on top of the existing costs for every call.
For firms that use a frontier model as a judge to evaluate another LLM’s output, the evaluation step will cost almost as much as running the main model again. One study found this approach to be 18x more expensive than using smaller, purpose-built guardrail models.
Forecasting is off
What makes this particularly hard to manage is that the tools themselves cannot yet forecast their own costs. A new paper found that frontier models fail to accurately predict their token use.4
Runs on the same task can vary by as much as 30x in total tokens. So even those of us who have developed a sense for token consumption run into surprises. In the study, 6.7% of tasks lasting less than 15 minutes cost more than the average 1-hour task.
The cost management ladder
So, what does this mean for companies? We will see firms climb the maturity ladder as they figure out how to track token spend and likely evolve into more sophisticated observation and monitoring of model calls.
Firms will benchmark their internal processes against their previous experience and, when the data is available, against those of other companies. The ambition will be to optimize price-to-outcome ratios. This will all help clarify the unit economics for specific commercial outcomes. The value of observability and monitoring will increase, as evidenced by the share price rise of Datadog, a leading observability platform.
Long-term, attributing costs to specific teams, workflows, and customers will help finance teams budget for AI spending and decide how to route and tune model use.
Taken together, the firms with the most mature approach to their AI deployments should start to understand the true unit economics of their outputs in the coming quarters.
This is the total volume of tokens (input + output) processed by LLMs across all providers. Our methodology uses a bottom-up, per-provider approach built up across 50+ providers. Cross-checks: revenue÷price, GPU utilisation, API call volume.
Token amplification is when agents do the things that inflate the number of tokens used per human interaction.
Content and policy filters.
The correlation between predicted and actual token consumption is weak to moderate, with a maximum of r = 0.39.










This all makes sense but it is also going to be a passing phase. AI will get more efficient at using tokens and the level of use will become more predictable as we learn.
This is the part CFOs will feel before most AI teams have language for it.
The price per token can keep falling while total AI spend keeps climbing, because agentic work is mostly hidden context loading, tool calls, retries, checks and evaluation. The unit of cost is no longer “one prompt.” It is a workflow.
The next layer after observability is reuse.
If every agent has to re-read the same repo, rediscover the same policy, repeat the same failed path, and relearn the same human correction, monitoring only tells you how expensive your forgetting is.
The companies that get this right will track token spend by team, workflow and customer, yes. But they will also capture which context, corrections and decisions actually improved outcomes, then make those reusable by the next run.
That is where the unit economics start to change: fewer ghost tokens, fewer repeated tool calls, fewer cold starts.
This is exactly why we are building Memco as shared memory for agentic work. One agent learns. The next one should not pay to rediscover it.