In advanced economies, knowledge workers represent roughly a quarter of the labor force and generate trillions in output. Even modest gains in their productivity, powered by advancements in AI, could trigger enormous economic returns.
But those gains wonât simply come from making models bigger. Itâs about teaching these models to reason systematicallyâan approach made possible by reinforcement learning. And itâs already happening, big time.
Last week, Alibaba unveiled QwQ-32B, a model punching well above its weight. Despite being significantly smaller than competitors like DeepSeek-R1, early benchmarks show comparable performanceâthe latest evidence that reinforcement learning is changing the AI landscape.
Even the Qwen team noted, âScaling RL has the potential to enhance model performance beyond conventional pre-training and post-training methods.â
This marks a crucial turning point. Weâre no longer talking about AI systems parroting patterns from massive datasets. Instead, weâre witnessing the rise of models capable of structured, methodical reasoningâa shift that could redefine knowledge work and directly impact your role, your industry, and even your future.
Hereâs how we got here.
Move 37
In the spring of 2016, the Go community was stunned. AlphaGoâs MoveâŻ37 in its match against world champion Lee Sedol initially appeared to be a blunder. Even Sedol left the room, rattled by the machineâs improbable move.
At first, many observers wrote off AlphaGoâs Move 37 as an odd mistake. Yet once the match ended, experts realized that centuries of human strategy had just been outmaneuvered by a distinctly machine-led perspective.
The secret behind this move was reinforcement learning.
Instead of relying on pre-coded strategies, RL systems learn optimal play through trial and error: they earn rewards for successful moves and penalties for poor ones. Over countless iterations, AlphaGo refined its gameplay well beyond conventional patterns.
This moment marked the shift from pattern matching to systematic problem-solving. Until then, many had assumed computers were just replaying memorized moves. But here was AlphaGo inventing a move grandmasters never anticipatedâcrossing a threshold from rote calculation to meaningful innovation.
The drama seen on the Go board is now unfolding in the realm of human language. Reinforcement learning excelled in precise, well-defined environmentsâlike board gamesâwhere outcomes are clear-cut: did the AI improve the board position? Did it win or lose? Those simple yes/no signals propelled AlphaGo to superhuman levels of play.
Outside of games, however, RL faced obstacles. Many real-world tasks lack clean âvictory conditions,â so it was harder to apply the same reward-based methods.
Meanwhile, by 2019, OpenAIâs GPT-2 drew massive attention with its ability to produce coherent text. Other large language models quickly followed and suddenly RL took a backseat as researchers rushed to make the most of the power of large-scale text prediction.
But RL never disappearedâit was just regrouping.
Now, it has returned to the spotlight to combine language generation with the rigorous, step-by-step logic that made MoveâŻ37 possible in Go.
RL be back
With the release of o1-preview at the end of last year, OpenAI introduced âreasoning modelsâ that âthinkâ before answering a userâs query.
This thinking process was trained using RL which improved the modelâs performance on STEM and coding tasks, problem-solving that require multi-step logic rather than simple word prediction.
A good example of RLâs strengths is obvious from a coding challenge set by Andrej Karpathy. He asked a model for a Settlers of Catanâstyle hex grid on a webpage, with each tile numbered and the entire grid size adjustable via a slider. In other words, a single HTML page that can dynamically generate and label an ever-changing hexagonal layout.
An RL-trained reasoning model systematically breaks the request into smaller tasks:
Define the problem (creating an interactive hex-grid webpage)
Plan the technical approach (selecting appropriate tools like HTML, CSS, JavaScript and SVG)
Solve geometric and numbering logic (calculating coordinates, hex shapes, and numbering the tiles)
Implement interactivity and finalize the solution (adding a slider for adjusting board size, centring the grid, and producing usable code)
By handling each sub-problem in sequence, the RL-based model can produce a working solution that genuinely meets the prompt.
In contrast, a standard large language model like GPT-4 might produce code that looks promising but doesnât fully solve the geometry or interaction details.
Itâs not just for coding either. OpenAIâs Deep research also employed RL training to be able to conduct multi-step investigations instead of one-shot answers. Itâs not perfect, but it offers an early idea of RLâs potential to make AI perform economically-significant tasks.
From language to reasoning
Reinforcement learning seemed a natural way to improve large language models, yet applying it to language was far from straightforward. Models like GPT-3 learn to predict the next word by identifying patterns in massive amounts of textâin GPT-3âs case, about 200 billion words.
But RL hinges on accumulating rewards, such as scoring a point in a video game or capturing a piece in chess, and it was unclear how to define a similar ârewardâ for language. In a game of chess, the payoff is obvious when a piece is taken. But what is the correct answer for completing the sentence âI love it when itâŚ.â?
Initially, RL was used to fine-tune LLMs with human feedback, ensuring they responded politely and helpfully. This was pioneered by Christiano et al. (2017) and later popularised by OpenAIâs InstructGPT. In essence, reinforcement learning from human feedback (RLHF) made models more user-friendly rather than sharpening their underlying reasoning skills.
That began to change when researchers started applying RL to tasks with clear success criteriaâsuch as writing bug-free code or solving multi-step math puzzles. Here, language outputs can be measured objectively: a theorem can be proven or disproven by a logic checker, and code either passes its unit tests or fails. With reliable ways to check correctness, models can be rewarded (or penalised) based on how accurately they solve the problem, rather than on how âniceâ they sound.
Over the past six months, AI labs including OpenAI, Google, Anthropic, xAI and leading Chinese research teams have introduced advanced models that leverage RL for what is called chain-of-thought reasoning. These systems begin with a powerful language model and then use RL to refine their ability to think step-by-step. Complex tasksâcoding hex grids, solving physics equationsâare broken into logical steps, each of which is tested for accuracy or consistency, often via automated verifiers or test suites. Correct steps are rewarded, gradually improving the modelâs reasoning process.
A good illustration of this method is the DeepSeek-R1 reasoning model. According to their research paper, DeepSeek first created a precursor model, R1-Zero, which used RL exclusively to attempt various tasks. This produced an extensive datasetâover 600,000 reasoning example.1 After filtering and refining this data, DeepSeek trained the final R1 model, blending robust problem-solving skills with coherent language generation. The resulting reasoning model doesnât just predict the next word; it learns to systematically think through, plan and revise its approach to a problem. This way it can tackle tasks that would flummox traditional LLMs.

This training process sometimes leads to surprisingly human-like leaps. During DeepSeek-R1-Zero development, logs showed a moment when the model paused mid-thought, noticed a contradiction and revised its approachâwithout human direction to âstop, reflect, and backtrack.â
R1-Zero spontaneously developed self-correction. Over repeated trials, the system learned to double-check its logic, much like someone spotting a misplaced puzzle piece. It was as though the model discovered how to assess its own reasoning.
Educational theorists have long emphasised trial-and-error plus reflective feedback. Jean Piagetâs theory of cognitive development showed how new information forces children to adapt. R1-Zeroâs cycle of proposeârewardârevise closely echoed these ideas, hinting at metacognitionâthough it lacks human-like consciousness. Its âthoughtsâ are digital tokens, its ârewardâ a numeric measureâyet the parallels to human learning are there.

Similarly, OpenAIâs o3 gained attention because it no longer needed pre-programmed shortcuts. Earlier models, like o1-ioi, relied on carefully designed rules to solve coding tasks, but o3 took a different approach. By expanding its reinforcement learning stage, it learned how to independently generate, test and improve its solutions. In fact, o3 became skilled enough to perform detailed checks on its solutions, sometimes even testing multiple options to see which worked best. Itâs now so capable that it ranks among the top 200 competitors on CodeForces, a prestigious competitive programming site.
But what if we take it further? Imagine your task involves finding an obscure spreadsheet buried somewhere in your companyâs Google Drive, running an analysis on it and sending the insights to senior management.
Reasoning models already do well at breaking down problems and planning logical stepsâbut what if we gave them tools? What if we turned them into fully autonomous agents capable of independently searching, analysing, and delivering results?
Well, it turns out â you can use RL to make functioning agents.
Unlocking agents
Consider OpenAIâs Deep research. Unlike earlier AI models that required explicit instructions or carefully handcrafted rules, Deep research was trained using end-to-end reinforcement learning. It autonomously crafts research strategies, dynamically adapts based on real-time search results âmuch like a human analyst exploring uncharted territory.
Josh Tobin of OpenAI describes this as a critical turning point:
If youâre able to optimize directly for the outcome youâre looking for, the results are going to be much, much better than if you try to glue together models not optimized end-to-end.
Tobin emphasizes that reinforcement learning is likely essential. The model independently plans, executes and refines its approach without scripted directions or frequent human oversight.
Imagine AI agents independently backtesting investment strategies in finance, verifying arguments against legal precedents, refining medical diagnoses through simulations, or automating complex engineering validations. In Josh Tobinâs words
Thereâs nothing stopping that recipe from scaling to increasingly complex tasks⌠It will surprise people.
Agents empowered with autonomous reasoning and multi-step planning could reshape entire industriesâfundamentally, not incrementally.
The economic implications
There is an economic earthquake waiting to happen. GPT-4o managed just 30% on SWEâŻBenchâs real-world coding tasks, while ClaudeâŻ3.7âŻSonnet â a reasoning model âachieved 70%. That 40-point gap is already reflected on the bleeding edge of our economy: 25% of the latest YâŻCombinator startups generate 95% of their code with AI.
Why? Because AI-powered development leads to blazing-fast minimum viable products. Even the now-primitive 2023-era GitHubâŻCopilot (think âWindowsâŻ95â of coding AIs) helped developers finish tasks 50% faster. I can only guess how much faster it is now with products like Cursor and Windsurf âand the number would be so wild I wouldnât dare write it down without peer-reviewed evidence.
Software engineering might be the most obvious early frontier. Developers sit right next to AIâs beating heartâbut other analytical fields are following closely behind.