š® Autoresearch and the experimental society
On bringing the scientific method into everyday knowledge work
Science is the most reliable method humanity has found for producing knowledge. It has also, for most of history, been expensive to run.
Andrej Karpathy released 600 lines of Python code a few weeks ago that started to change that. His autoresearch (see EV#565) runs an autonomous experimental loop in which a human sets a strategic direction, defines what good looks like and the agent iterates towards success within the guardrails. In Andrejās initial experiment, it trained a GPT-2-level model over two days, 11% faster and found 20 genuine improvements.
Shortly after the release, Shopifyās CEO Toby Lütke used autoresearch on his companyās internal model, qmd; it ran 37 experiments overnight, so Toby woke up to a 0.8-billion-parameter model outscoring his previous 1.6-billion-parameter version by 19%. Toby is not a machine learning engineer.
Autoresearch is powerful because it solves two problems at once. One, it automates part of the knowledge-production process. And two, it solves the agent control problem, meaning it keeps agents on task. AI often drifts if you give it an open-ended brief or optimize for the wrong thing. To my great joy, autoresearch prevents this by design. The human decides where the car is going; autoresearch keeps its hands on the wheel.
I spent the last month adapting autoresearch for knowledge work beyond machine learning with the goal of spinning up a system that can run structured, low-cost experiments on the kinds of decisions most teams make every week. Iām calling this version AutoBeta, and Iām making the full playbook/skill available to paying members below.
Letās go!
The measurement problem
When I first saw autoresearch, my immediate reaction was that it didnāt have to be just about machine learning. The loop is generic ā hypothesize, test, score, iterate. So I cloned it and started running it on other bits of my work.
It didnāt go exactly as I expected. The outputs looked fine but I couldnāt tell whether they were getting better. Unlike ML where the agent has a built-in feedback signal from each training run, knowledge work was missing it. A pricing decision doesnāt validate in five minutes; and the paragraphs I write donāt tell me if the argument is getting better or just changing, most of the time.
This is what makes applying autoresearch to knowledge work genuinely hard. The loop needs something to optimize against, and in knowledge work, that something doesnāt exist naturally.
So I constructed a version of autoresearch, called AutoBeta, that works across a wide range of business problems. It is not as technically robust as Karpathyās, but it has the same design principles: I set the objective and the constraints, the experimentation takes place within the loop.
The one thing I changed was the score. I created āan oracleā, a panel of synthetic judges scoring each output against pre-defined criteria, collapsed into a single number the loop can optimize for.

