Hi all, it’s Azeem.
There’s some evidence suggesting that AI is making workers more productive. But what if we’re getting it all wrong?
Just because engineers who use AI seem more productive doesn’t mean AI made them more productive. It might just be that the best engineers are the ones most eager to adopt new tools.
This is an old problem in economics. For decades, researchers struggled to separate true cause-and-effect from mere correlation. Then came the “credibility revolution,” a shift in how economists think about causality.
has unpacked this concept to show how advanced statistics can distinguish genuine productivity gains from selection bias. It’s a rigorous, eye-opening approach that underlies Workhelix, the startup he cofounded with and Daniel Rock.This essay first appeared on Andrew’s Substack, The Geek Way (which is also the title of his most recent book), and he kindly agreed to share it with Exponential View. I hope it resonates with you all as it did with me.
Read on!
Azeem
AI Revolution, Meet the Credibility Revolution
By
We received a first batch of data from a customer a couple of weeks ago and got to work on it.
The data were from a retailer who, like everyone else these days, is interested in putting generative AI to work. An obvious place to start that journey is with software development. Software used to be really bad at writing software, but thanks to GenAI the situation has changed deeply and quickly. Lots of companies are now using GenAI to write code, and some are apparently using GenAI to write all of their code. As a recent headline in TechCrunch put it, “A quarter of startups in [incubator Y Combinator’s] current cohort have codebases that are almost entirely AI-generated.”
Our customer took the plunge in the second half of 2024, making a popular GenAI coding assistant available to its software engineers. The company was naturally interested in whether or not this was a good thing to have done.
Was the GenAI coding assistant a good investment? Did it make engineers more productive? Did it affect all engineers equally? Did it affect all relevant performance measures equally?
One way of determining the ROI of AI
These are all important questions, but they’re surprisingly difficult to answer outside a lab. Inside a lab, you’d tackle them through a series of experiments. For example, you’d randomly divide the engineers into two groups, give the GenAI to one group but not the other, give both groups a set of coding exercises, and then compare the results. Then do more experiments in the same vein.
This approach would eventually give our customer many of the answers they were looking for, but I have yet to come across a company that has been willing to tell its entire software development team to stop their work, head off to a lab, and undergo experiments for 6-12 months. There’s a strong feeling among executives that while doing so might allow them to learn a lot about the efficacy of GenAI for coding, it would also drive them out of business. We can bemoan how the grubby realities of capitalism impede the pursuit of knowledge, but here we are.
What companies do instead is leave their engineers in their jobs and measure their performance in situ. They know that this is suboptimal, and that it’s hard to separate signal from noise and causation from correlation in the messy real world. But when you’re trying to run a business, them’s the breaks. You have to work with what you have.1
A natural way to begin that work is with some comparisons. Are engineers who use AI more productive than those who don’t? Two factors made this comparison possible for our customer. First, not all of its engineers started using the GenAI coding assistant once it became available, so there was some variation to exploit.
Second, there are some readily available measures of coding productivity. It’s true that software development productivity is notoriously hard to measure. We’ve been arguing about it as long as we’ve been writing software. But lots of software development shops today keep track of pull requests, merges, and the time elapsed between the two (“time to merge,” or TTM). In other words, they keep track of how many new chunks of coding work someone started in a given time period, how many completed chunks they turned in, and how many hours elapsed between starting and finishing a chunk.
So great, let’s start there. Here’s what the data from our customer showed about the differences in pull request counts, merge counts, and average TTM between engineers who used GenAI and those who didn’t:
This initial comparison is encouraging! More work is needed and yadda yadda, but this looks like a big 40+% performance boost in pull request and merge rates from using AI.
Why that was the wrong way
The existence of this post is a strong signal that there is a problem with that conclusion. And there is. More broadly, there are serious problems with lots of the analyses businesses have been conducting on messy real-world data to understand causality — to understand if one thing (like adopting GenAI for software development) caused another (like higher software development productivity) to happen.
Economist Edward Leamer discussed these problems in his infamous 1983 paper “Let’s Take the Con Out of Econometrics,” published in the August edition of the American Economic Review. They don’t write ‘em like that any more, especially in places like the AER, which seem to have adopted in recent decades a misguided policy of not letting their authors cook.2 Back in ‘83 the AER let Leamer roast his own discipline, calling it out for sloppy thinking and shoddy methodologies.
“Let’s Take the Con Out of Econometrics” joined a wave of papers pointing out that empirical economics was not doing a great job at using data to figure out what caused what — “causal inference,” in other words. Causal inference is one of the main reasons to do empirical economics, so these papers were highlighting something of a problem for the field.
A good way to get causal inference wrong is to start with the wrong assumptions. To see this, let’s return to our customer. In their situation, a lot of companies would assess the ROI of their AI by comparing the productivity of GenAI-using software engineers with the productivity of non-users. In other words, they’d do the comparisons graphed above.
But this causal inference strategy rests on a big, bad assumption. That assumption is that the only relevant difference between the two groups of engineers is that the people in one group decided to use GenAI, while the people in the other didn’t. In other words, the assumption is that the way engineers wound up with GenAI at the company was as if they had been randomly assigned to get GenAI or not.
If that assumption holds, then the simple comparisons shown above actually do tell you a lot about causality. If you take a bunch of overweight people and randomly assign half of them to get Ozempic (putting them in what’s called the treatment group), then find that on average those people lost a great deal more weight over time than people in the other group (the control group) did, you’ve got good reason to infer that the treatment worked — it caused the weight loss — and that you’ve got a blockbuster drug on your hands.3
So, in our customer’s case, was assignment into the treatment group effectively random? Nope. Not even close. To see this, let’s look at performance differences between the two groups not just during the time after GenAI was introduced, but also before:
So much for that assumption. As a group, the engineers who started using GenAI when it became available were very different from the ones who didn’t. To make matters worse, the two groups were different in exactly the area that we care about: productivity. On average, the engineers who reached for the newly available GenAI were doing significantly more PRs and merges than those who didn’t before GenAI appeared on the scene.
Why were these engineers more productive? And why were they more likely to start using GenAI — why did they self-select into using the snazzy new technology? We don’t know yet. These are interesting and important questions, and we want to investigate them. At this point, all we know with confidence is that the GenAI-using engineers were more productive both before and after GenAI showed up.
We also don’t yet know the answer to the question we started with: did GenAI make software engineers at our customer more productive? Given that we can’t stick them all in a lab and do experiments on them, and given that they did not randomly (and conveniently for us) assign themselves to using GenAI or not, can we get any reliable insight into this question with the data we have and the situation we have? Can we do solid causal inference?
We can. We can make progress on taking the con out of econometrics in this case.
Here’s a better way
The problem confronting our customer is one of the classic problems in econometrics: separating treatment effects from selection effects. Treatment effects are what we’re actually interested in: how much did GenAI affect engineers’ performance? (Just like Novo Nordisk and the FDA are interested in the treatment effect of Ozempic on obesity). But as we’ve just seen, there’s also a big selection effect on the customer: skilled engineers disproportionately self-selected into using GenAI. So when we compare the performance of GenAI users to non-users what we’re picking up, at least in part, is the difference in skill between those two groups.
But all is not lost, and the situation is not hopeless. We have tools to disentangle selection and treatment effects. One that’s used a lot in modern econometrics is the counterfactual — a statistical what-if scenario that lets us compare what actually happened to an alternative. I think of counterfactuals as nearby parallel universes where things are only slightly different.
In this case, the parallel universe we want to peer into is one where everything is exactly the same as in this one, except that our customer never adopts GenAI. We’re going to gather the identical data in that universe that we did in this one and compare the two datasets. We’re especially interested in the performance of the GenAI-adopting engineers in that other universe vs. in this one.
I hope you’re now saying to your screen, “Wait. Didn’t you just tell me that those engineers didn’t ever get GenAI in that other universe?” That’s right, and that’s exactly the point. In our universe, we’ve got a group of skilled engineers who started using GenAI at one point in time.4 In the counterfactual universe, we’ve identified exactly that same group of people, and we can watch their performance over exactly the same timespan, but with no GenAI use at any point. We can therefore infer that any differences observed between these two groups have to be caused by AI use, since that’s the only difference between them.
I won’t go into the details about how to create that counterfactual parallel universe. It takes a fair amount of economics and statistics education and experience, a bunch of data, and a few assumptions. But it’s possible, and it allows us to peer into many parallel universes we’re interested in. So what do we see when we peer into the one we’re interested in here? Let’s take a look:
In each graph, data from our parallel universe - from the counterfactual - is shown in the left-hand column. The little barbell stuck on top of that column is the confidence interval for our estimate of what happened in the parallel universe. Even with great data and our most sophisticated statistical bag of tricks we don’t have a perfect view into that universe, so we need to express our uncertainty about what we see there. Hence the confidence interval, which widens a point estimate — a single number — into a range. In this case, it’s a 90% confidence interval, meaning that we’re 90% sure that what’s really happening in the parallel universe falls within the range indicated by the top and bottom of the barbell.
The comparisons shown in the graphs above are the comparisons we’re really interested in. They show the estimated effect of using GenAI (the treatment effect) as opposed to showing the combined effects of using GenAI and being the kind of engineer who’s an early adopter of GenAI (the treatment effect plus the selection effect).
Look at the first two columns above, and compare them to the same columns in this post’s top graph. The difference between the two visualizations tells us something both intuitive and important. What’s intuitive is that treatment effects alone are smaller than treatment effects and selection effects combined (as is often the case). The top graph shows massive increases in PRs and merges; the lower graph —which isolates the effect of GenAI use — shows much smaller reductions.
What’s important is the size of the treatment effect. It’s critical to isolate this effect for obvious reasons: knowing how much GenAI affects productivity helps with planning, hiring, and training, and also informs how much you’re willing to pay for the GenAI (which, in turn, informs your negotiations with the vendor of the software).5
It would be great to present to your boss or your board that your brilliant decision to adopt GenAI led to a 40+% productivity boost in less than half a year. It would be lousy (in all kinds of ways) for you to make that presentation when the actual productivity boost was something like 12%. So if conducting or presenting AI ROI analyses is part of your job, you should work with us. You should work with Workhelix, because we’re bringing the credibility revolution from economics to the enterprise.
You can't fool children of the revolution
Selection effects, self-selection, treatment effects, counterfactuals, and confidence intervals: we’ve already made a good start on understanding the vocabulary of the “credibility revolution” in economics. This movement, which started in the late 20th century and blossomed in the 21st, is a big deal - big enough to have merited two batches of Nobel Prizes in recent years.6
The credibility revolution is a sea change in our ability to understand causality. It’s a big bag of research designs and statistical methods that when judiciously applied lets us make confident statements about what caused what. Within economics, it’s been applied to a huge range of important questions:
Does going to Harvard make you rich? Not usually, no. Most of the income acceleration we see from Harvard graduates is a classic selection effect. These folks were going to be rich no matter where they went to school.7
How bad is air pollution for our health? Really bad. Even though our air is much cleaner than it was a generation ago, research on the effects of modern rich-world pollution levels consistently shows how harmful they are.
Are police body cams effective? Generally yes, especially when “officers’ discretion to turn cameras on or off is minimized.”
Do experts or algorithms make better decisions? Algorithms, usually. Quite often, in fact, humans override algorithms too often, leading to worse outcomes than if we just relied on algorithms alone.
It is a very, very small leap from the questions above to the questions below, which are top of mind for a lot of executives and boards these days:
Do GenAI coding assistants make software engineers more productive? At our customer, they did.
Can a materials science lab become more innovative by giving its scientists access to a “large materials model?” Yep.
Do customers like it when GenAI starts suggesting responses to customer service reps? Do the reps themselves like it? Yes and apparently so, since churn goes down.
What’s the ROI of internal LLMs? Stay tuned on this one… ;)
Over the past year, I’ve asked lots of executives at lots of companies if they’ve ever heard of the credibility revolution. The only “yes” I’ve received wasn’t a surprise, since it came from a CEO who I knew had a PhD in economics. My straw poll indicates that the credibility revolution hasn’t yet made it from economics to the enterprise. Which opens up a huge opportunity.
Every ROI analysis is an exercise in causal inference. It’s an attempt to determine the changes caused by an investment. But today, most companies’ ROI analyses don’t use the tools of the credibility revolution; they instead rely on simple aggregates, averages, and comparisons.
As I hope I’ve shown above, it’s easy for pre-cred rev analyses to be way off. In our customer’s case, a comparison that didn’t separate treatment and selection effects would have massively overstated GenAI’s benefits. And it could have been worse. It’s pretty common for pre-cred rev analyses to “get the sign wrong” — to conclude that a treatment had a negative impact when it actually had a positive one (or vice versa.)
This is an excerpt of a longer post Andrew published here.
He’s shown us how advanced econometrics and careful causal inference can help separate hype from actual impact. As we move further into the generative AI era, I suspect this “credibility revolution” will become a superpower for leaders who want real answers—rather than best guesses—about ROI.
Let us know what resonates for you, and please share any examples from your own experiments. And if you have questions for Andrew about his approach, comment below.
Of course, you can and should conduct experiments while still conducting business as usual. We’ll talk more here later about how to do this so that you both a) learn something valuable and b) don’t screw up the business.
One lone sentence from Leamer suffices to convey the losses of both style and insight brought on by this policy: “methodology, like sex, is better demonstrated then discussed, though often better anticipated t-an experienced.”
Your confidence goes up a lot if the experiment is double-blind: if neither the patients nor the people giving them the drug knew if patients were getting an actual medicine or a placebo.
This selection effect is interesting, isn’t it? The fact that skilled engineers at this organization were heavier and quicker users of GenAI is intriguing evidence that the tool is in fact a productivity booster. After all, experienced professionals rarely use useless tools when they don’t have to.
Isn’t it also interesting that in this case GenAI Did not (yet) cause a big reduction in TTM? Fruitful area for some further investigation, I’d say. I’m curious to learn what's causing these times to not go down. I bet most customers would be, too.
David Card, Josh Angrist, and Guido Imbens won the Nobel in 2021 for their contributions to observational causal inference - using all kinds of clever techniques to look at historical data and figure out what caused what. Abhijit Banerjee, Esther Duflo, and Michael Kremer won in 2019 for their experimental work, which popularized the use of randomized control trials for causal inference. The trio were especially interested in determining which interventions cause poverty reductions.
There are exceptions here. For some demographic groups, it looks like there is a treatment effect from attending Harvard, probably because of the network of people you form there. As a former Harvard professor, it pains me to acknowledge that the treatment effect of the instruction students get from us faculty is… hard to see in the data.
I so enjoyed this, Andrew. Not only a great piece of analysis, but some damn fine storytelling too!
Outstanding. Learned so much useful and interesting stuff in reading this. Thanking Azeem, Andrew.