In advanced economies, knowledge workers represent roughly a quarter of the labor force and generate trillions in output. Even modest gains in their productivity, powered by advancements in AI, could trigger enormous economic returns.
But those gains wonāt simply come from making models bigger. Itās about teaching these models to reason systematicallyāan approach made possible by reinforcement learning. And itās already happening, big time.
Last week, Alibaba unveiled QwQ-32B, a model punching well above its weight. Despite being significantly smaller than competitors like DeepSeek-R1, early benchmarks show comparable performanceāthe latest evidence that reinforcement learning is changing the AI landscape.
Even the Qwen team noted, āScaling RL has the potential to enhance model performance beyond conventional pre-training and post-training methods.ā
This marks a crucial turning point. Weāre no longer talking about AI systems parroting patterns from massive datasets. Instead, weāre witnessing the rise of models capable of structured, methodical reasoningāa shift that could redefine knowledge work and directly impact your role, your industry, and even your future.
Hereās how we got here.
Move 37
In the spring of 2016, the Go community was stunned. AlphaGoās MoveāÆ37 in its match against world champion Lee Sedol initially appeared to be a blunder. Even Sedol left the room, rattled by the machineās improbable move.
At first, many observers wrote off AlphaGoās Move 37 as an odd mistake. Yet once the match ended, experts realized that centuries of human strategy had just been outmaneuvered by a distinctly machine-led perspective.
The secret behind this move was reinforcement learning.
Instead of relying on pre-coded strategies, RL systems learn optimal play through trial and error: they earn rewards for successful moves and penalties for poor ones. Over countless iterations, AlphaGo refined its gameplay well beyond conventional patterns.
This moment marked the shift from pattern matching to systematic problem-solving. Until then, many had assumed computers were just replaying memorized moves. But here was AlphaGo inventing a move grandmasters never anticipatedācrossing a threshold from rote calculation to meaningful innovation.
The drama seen on the Go board is now unfolding in the realm of human language. Reinforcement learning excelled in precise, well-defined environmentsālike board gamesāwhere outcomes are clear-cut: did the AI improve the board position? Did it win or lose? Those simple yes/no signals propelled AlphaGo to superhuman levels of play.
Outside of games, however, RL faced obstacles. Many real-world tasks lack clean āvictory conditions,ā so it was harder to apply the same reward-based methods.
Meanwhile, by 2019, OpenAIās GPT-2 drew massive attention with its ability to produce coherent text. Other large language models quickly followed and suddenly RL took a backseat as researchers rushed to make the most of the power of large-scale text prediction.
But RL never disappearedāit was just regrouping.
Now, it has returned to the spotlight to combine language generation with the rigorous, step-by-step logic that made MoveāÆ37 possible in Go.
RL be back
With the release of o1-preview at the end of last year, OpenAI introduced āreasoning modelsā that āthinkā before answering a userās query.
This thinking process was trained using RL which improved the modelās performance on STEM and coding tasks, problem-solving that require multi-step logic rather than simple word prediction.
A good example of RLās strengths is obvious from a coding challenge set by Andrej Karpathy. He asked a model for a Settlers of Catanāstyle hex grid on a webpage, with each tile numbered and the entire grid size adjustable via a slider. In other words, a single HTML page that can dynamically generate and label an ever-changing hexagonal layout.
An RL-trained reasoning model systematically breaks the request into smaller tasks:
Define the problem (creating an interactive hex-grid webpage)
Plan the technical approach (selecting appropriate tools like HTML, CSS, JavaScript and SVG)
Solve geometric and numbering logic (calculating coordinates, hex shapes, and numbering the tiles)
Implement interactivity and finalize the solution (adding a slider for adjusting board size, centring the grid, and producing usable code)
By handling each sub-problem in sequence, the RL-based model can produce a working solution that genuinely meets the prompt.
In contrast, a standard large language model like GPT-4 might produce code that looks promising but doesnāt fully solve the geometry or interaction details.
Itās not just for coding either. OpenAIās Deep research also employed RL training to be able to conduct multi-step investigations instead of one-shot answers. Itās not perfect, but it offers an early idea of RLās potential to make AI perform economically-significant tasks.
From language to reasoning
Reinforcement learning seemed a natural way to improve large language models, yet applying it to language was far from straightforward. Models like GPT-3 learn to predict the next word by identifying patterns in massive amounts of textāin GPT-3ās case, about 200 billion words.
But RL hinges on accumulating rewards, such as scoring a point in a video game or capturing a piece in chess, and it was unclear how to define a similar ārewardā for language. In a game of chess, the payoff is obvious when a piece is taken. But what is the correct answer for completing the sentence āI love it when itā¦.ā?
Initially, RL was used to fine-tune LLMs with human feedback, ensuring they responded politely and helpfully. This was pioneered by Christiano et al. (2017) and later popularised by OpenAIās InstructGPT. In essence, reinforcement learning from human feedback (RLHF) made models more user-friendly rather than sharpening their underlying reasoning skills.
That began to change when researchers started applying RL to tasks with clear success criteriaāsuch as writing bug-free code or solving multi-step math puzzles. Here, language outputs can be measured objectively: a theorem can be proven or disproven by a logic checker, and code either passes its unit tests or fails. With reliable ways to check correctness, models can be rewarded (or penalised) based on how accurately they solve the problem, rather than on how āniceā they sound.
Over the past six months, AI labs including OpenAI, Google, Anthropic, xAI and leading Chinese research teams have introduced advanced models that leverage RL for what is called chain-of-thought reasoning. These systems begin with a powerful language model and then use RL to refine their ability to think step-by-step. Complex tasksācoding hex grids, solving physics equationsāare broken into logical steps, each of which is tested for accuracy or consistency, often via automated verifiers or test suites. Correct steps are rewarded, gradually improving the modelās reasoning process.
A good illustration of this method is the DeepSeek-R1 reasoning model. According to their research paper, DeepSeek first created a precursor model, R1-Zero, which used RL exclusively to attempt various tasks. This produced an extensive datasetāover 600,000 reasoning example.1 After filtering and refining this data, DeepSeek trained the final R1 model, blending robust problem-solving skills with coherent language generation. The resulting reasoning model doesnāt just predict the next word; it learns to systematically think through, plan and revise its approach to a problem. This way it can tackle tasks that would flummox traditional LLMs.

This training process sometimes leads to surprisingly human-like leaps. During DeepSeek-R1-Zero development, logs showed a moment when the model paused mid-thought, noticed a contradiction and revised its approachāwithout human direction to āstop, reflect, and backtrack.ā
R1-Zero spontaneously developed self-correction. Over repeated trials, the system learned to double-check its logic, much like someone spotting a misplaced puzzle piece. It was as though the model discovered how to assess its own reasoning.
Educational theorists have long emphasised trial-and-error plus reflective feedback. Jean Piagetās theory of cognitive development showed how new information forces children to adapt. R1-Zeroās cycle of proposeārewardārevise closely echoed these ideas, hinting at metacognitionāthough it lacks human-like consciousness. Its āthoughtsā are digital tokens, its ārewardā a numeric measureāyet the parallels to human learning are there.

Similarly, OpenAIās o3 gained attention because it no longer needed pre-programmed shortcuts. Earlier models, like o1-ioi, relied on carefully designed rules to solve coding tasks, but o3 took a different approach. By expanding its reinforcement learning stage, it learned how to independently generate, test and improve its solutions. In fact, o3 became skilled enough to perform detailed checks on its solutions, sometimes even testing multiple options to see which worked best. Itās now so capable that it ranks among the top 200 competitors on CodeForces, a prestigious competitive programming site.
But what if we take it further? Imagine your task involves finding an obscure spreadsheet buried somewhere in your companyās Google Drive, running an analysis on it and sending the insights to senior management.
Reasoning models already do well at breaking down problems and planning logical stepsābut what if we gave them tools? What if we turned them into fully autonomous agents capable of independently searching, analysing, and delivering results?
Well, it turns out ā you can use RL to make functioning agents.
Unlocking agents
Consider OpenAIās Deep research. Unlike earlier AI models that required explicit instructions or carefully handcrafted rules, Deep research was trained using end-to-end reinforcement learning. It autonomously crafts research strategies, dynamically adapts based on real-time search results āmuch like a human analyst exploring uncharted territory.
Josh Tobin of OpenAI describes this as a critical turning point:
If youāre able to optimize directly for the outcome youāre looking for, the results are going to be much, much better than if you try to glue together models not optimized end-to-end.
Tobin emphasizes that reinforcement learning is likely essential. The model independently plans, executes and refines its approach without scripted directions or frequent human oversight.
Imagine AI agents independently backtesting investment strategies in finance, verifying arguments against legal precedents, refining medical diagnoses through simulations, or automating complex engineering validations. In Josh Tobinās words
Thereās nothing stopping that recipe from scaling to increasingly complex tasksā¦ It will surprise people.
Agents empowered with autonomous reasoning and multi-step planning could reshape entire industriesāfundamentally, not incrementally.
The economic implications
There is an economic earthquake waiting to happen. GPT-4o managed just 30% on SWEāÆBenchās real-world coding tasks, while ClaudeāÆ3.7āÆSonnet ā a reasoning model āachieved 70%. That 40-point gap is already reflected on the bleeding edge of our economy: 25% of the latest YāÆCombinator startups generate 95% of their code with AI.
Why? Because AI-powered development leads to blazing-fast minimum viable products. Even the now-primitive 2023-era GitHubāÆCopilot (think āWindowsāÆ95ā of coding AIs) helped developers finish tasks 50% faster. I can only guess how much faster it is now with products like Cursor and Windsurf āand the number would be so wild I wouldnāt dare write it down without peer-reviewed evidence.
Software engineering might be the most obvious early frontier. Developers sit right next to AIās beating heartābut other analytical fields are following closely behind.
Keep reading with a 7-day free trial
Subscribe to Exponential View to keep reading this post and get 7 days of free access to the full post archives.