🔮 AI’s diminishing marginal returns; Yellen’s time machine; TikTok boom; cycling, beauty & vintage phones++ #469
Hi, I’m Azeem Azhar. In this week’s edition, we explore why data is not all we need to take AI to the next level of competency.
And in the rest of today’s issue:
Need to know: Winner takes all
The UK’s Competition Authority uncovers over 90 interconnected investments that could allow tech giants to shape AI in their own interests.Today in data: India’s power
India is building the world’s largest renewable energy park with capacity to power 18 million homes.Opinion: GenAI is scaling at work
Four key challenges will define how organisations leverage AI in 2024.
We ask you: How are we doing?
Respond to our nine-question survey for a chance to win a book bundle of my five favourite books of all time.
Sunday chart: Data is all you need. Or is it?
AI progress over the past few years has been defined by LLMs and their incredible use of compute and data. This led to a consensus, which has delivered results so far, that more data means more progress — prompting AI companies to hoover up data in all sorts of ways (ethically and unethically), as reported in The New York Times and Reuters this week.
Yet it appears that this chase, at least in relation to LLMs, may be futile. Udandarao et al. highlight that for linear improvement in model performance, particularly for rarer concepts, you need exponentially more data. At some point, chasing model gains through scale only becomes economically infeasible. Diminishing marginal returns strikes again.
So what could be next? Yann LeCun, a long-time critic of LLMs as the path to human-level intelligence, is an advocate for the approach he calls “objective-driven AI”, built to fulfil specific goals set by humans. LeCun argues that systems that learn about the physical world through sensors and video data rather than relying on pure text can build a “world model” to predict outcomes and plan tasks — and overcome the limitations of LLMs that lack real-world understanding.
These approaches will still likely require lots of data. LeCun says a four-year-old child has already seen 50 times more data than the largest LLM. The crucial difference lies in the nature of data — children learn in the tangible, physical world, whereas machines are trained primarily on text. The architecture processing this data (the brain) remains an enigma but is far removed from the current capabilities of LLMs, despite their sophistication.
New architectural breakthroughs are likely needed to build AI systems capable of robust causal reasoning, planning and world understanding. We will need more than just data to get there.
See also: Felin & Holweg argue that despite AI’s impressive performance in cognitive tasks, human cognition’s unique ability to theorise, identify new data and problem-solve through causal logic makes it irreplaceable in strategic decision-making.
Are you looking for AI expertise you can trust?
Linear thinking no longer applies in our increasingly complex world. You need to think like an exponentialist. I can help elevate your organisation’s thinking, equip you with the frameworks and methodologies to prepare for what the near future of AI and exponential technologies holds.
Key reads
Big tech’s favourite ABBA song. The UK’s Competition and Markets Authority (CMA) is sounding the alarm over big tech’s tightening grip on the AI market. The CMA has uncovered over 90 interconnected investments that could allow tech giants to shape AI in their own interests, potentially stifling competition. The problem is quite obvious in the face of the outcomes of previous digital markets. EV reader Tim O’Reilly and colleagues argue that Amazon extracts “rents” (super-normal profits) from users by pushing paid sponsors to the top of search results even though they might not be the best quality or the best value. Amazon’s sponsored product ads are on average 17% more expensive and 33% lower quality than top organic search results. The relentless focus on profits has compromised quality in digital markets, and we must remain cautious to ensure AI does not succumb to the same monopolistic “winner takes it all” outcomes that skewed markets, competition and access in the tech industry.
Crippling progress. King Cnut’s advisors told him he could turn back the tide. Now Biden’s treasury vizier, Janet Yellen, is trying to turn back 200 years of economic wisdom, argues David Fickling. Yellen has vowed to protect the US from Chinese clean-energy “overproduction”, a mercantilist approach at odds with David Ricardo’s long-held theory of comparative advantage. It’s probably bad economics. But it could be great politics. Laggard nations have often protected nascent industries from advanced foreign competition — and looking at the extent to which the US has fallen behind in scaling climate tech manufacturers, perhaps this suggestion has legs. Yet on a global scale, this approach does not make sense. As Fickling points out, clean energy technology capacity is not excessive; in fact, there is a vast lack of production, taking into account the amount needed to address climate change. The US should be embracing Chinese exports when it can. Dirt-cheap solar panels are great news for the US economy since an electrified economy will be a better one. The US can then concentrate (as trade theory says it should) on doing things it does better than others.
Summarise this paper, please. This paper presents FABLES, the first large-scale human evaluation of faithfulness and content selection in book-length summarisation using LLMs. The authors collected 3,158 claim-level faithfulness annotations from LLM-generated summaries of 26 recently published books. They found that Claude 3 Opus was the most faithful summariser, followed by GPT-4 Turbo. However, the authors also discovered that LLM auto-raters struggled to reliably detect unfaithful claims, even when prompted with the full book text. Most unfaithful claims related to states and events, often requiring reasoning over extended contexts to detect. The authors also identified common content-selection errors in the LLM summaries, such as omissions of key events, attributes and characters, and over-emphasis of content from the end of books.
The summary you just read in the paragraph above was generated by Claude 3. An F-measure is a way of understanding how well a process performs in terms of precision and recall.1 Claude 3 achieved an F-measure of 47.5. A 2020 study looked at how well humans could classify scientific research abstracts (not an exact match to the FABLES challenge). It found that undergraduate students had an average F-measure of roughly 0.45 on that task, although the very best achieved roughly 0.7. So, while concerns over the faithfulness of LLM summaries are warranted, it doesn’t diminish their relative utility to the alternatives. In FABLES, each human annotator producing robust and evidenced summaries took more than 11 hours for each book. Sometimes, we won’t be able to afford 11 hours. Sometimes, graduate-quality summaries will satisfice. Perhaps other strategies (like chunking long texts or ensembling many summaries from different models) could provide even more reliable outputs from LLM-enabled systems. When absolute precision and perfect recall are required, for example, in legal contexts, it might not be wise to rely on a single LLM or, indeed, a single human.
Newsreel
The European Court of Human Rights rules that Switzerland violated human rights by failing to adequately combat climate change.
Nobel-winning physicist Peter Higgs, who theorised the existence of the elementary particle that gives mass to other fundamental particles, has passed away aged 94.
It’s been a week of model releases. Google’s Gemini 1.5 Pro, OpenAI’s GPT-4 Turbo with Vision, Mistral’s Mixtral 8x22B and Stability’s 12B parameter Stable LM 2. Meta is planning to release its next open-source model Llama 3 next month.
British research agency ARIA unveils a plan for AI “gatekeepers” to provide quantitative safety guarantees for high-risk AI applications.
Big tech companies including Microsoft, Google, IBM, Cisco and others are forming a group to study how artificial intelligence might impact technology jobs.
Data
TikTok owner ByteDance saw profits jump 60% to over $40 billion in 2023, surpassing rivals Tencent and Alibaba, as revenue neared $120 billion. ByteDance is growing faster than Meta with a similar level of revenue.
X’s number of active US users has declined 23% in the year to Jan 2025 to 27m. .
The world’s largest renewable energy park is in India and will generate 30 GW of clean energy.
The total solar eclipse earlier this week caused significant drops in internet traffic, with some areas experiencing decreases of up to 60%.
Between 1970 and 2022, growing human demand for food, fuel and cleared land reduced the combined populations of other vertebrate species by 69%.
Short morsels to appear smart at dinner parties
🚴🏽♀️ Cyclists now outnumber motorists in Paris.
🦾 Researchers trained a model to control a humanoid robot by predicting its next move, similarly to predicting the next word in LLMs. After training on 27 hours of data, the robot could generalise to commands it wasn’t previously taught. via EV reader
📣 Reddit, now publicly traded, is community-sourcing questions from its users for its upcoming earnings call.
💨 E-rickshaw companies are leading the electrification of transport in India, leaving traditional EVs in the dust.
😜 Research finds little evidence that professional success is linked to “beauty”.
🤙🏽 The dumbphone business is thriving.
🕺🏽 A brilliant history of the origins of disco. It sheds light on the persistence of tribal affiliations we are witnessing towards tech and within tech. (It’s a podcast.)
End note
I ditched Instagram a few weeks ago. It was proving to be too much of a time suck and feeding a little bit of anxiety. I’ve lost unending pages of cats (and occasionally dogs) doing crazy things; any number of health coaches demonstrating hip-opening programmes or variations on the pull-up; and, worst of all, access to a flow of new music from DJs around the world. Yes, it’s problematic that 10 minutes can turn into 60. But even if you stick to 10, Instagram will create work for you. If I found something useful, like an improved form for stretching my calves, I would need to learn and practise that exercise. Was it worth introducing that new exercise rather than progressing with what I do and working with my physio? My capacity to usefully absorb the firehose of interesting and useful things will be limited to my being, you know, human.
So, I’ve lost the videos of Bengal kittens playing with each other. But I’ve gained lots of time, most of which has gone into reading essays and books. My Oura ring also tells me it’s paying dividends. My sleep scores have increased from the high 70s to the mid-90s since I took Instagram off the phone.
Cheers,
A
What you’re up to — community updates
Watch the recording of our community session on the state of crypto with co-founder and CEO of Superchain, James Corbett. The transcript is available here.
- has co-launched Workhelix, a task-based tool for identifying enterprise opportunities for genAI.
Andrew Ng has joined the board of Amazon.
Cedric Maloux has written about his experience working with ChatGPT, his “genius stoned co-developer”.
Gideon Lichfield and Karen Hao have launched the AI Spotlight Series in collaboration with the Pulitzer Center to provide free AI-reporting training for 1,000 journalists globally over two years.
Joel Hellermark, founder and CEO of Sana, has published the company’s biggest engineering lessons from building Sana AI — an AI assistant for work powered by Sana’s state-of-the-art retrieval and agent solution R-4.
Husayn Kassai’s startup Onfido has been acquired, setting a record for the highest return on investment for an Oxford-student-led company.
Share your updates with EV readers by telling us what you’re up to here.
Precision measures the proportion of relevant results amongst all the results delivered by the system. Recall measures the proportion of relevant results retrieved amongst all the relevant results. The F-measure of F1 score is the harmonic mean of precision and recall that one can use to measure a system’s performance. When I used to build retrieval and classification systems, the F1 measure was a key indicator we would seek to improve. But you can already notice the degree of nuance required: do you improve recall or performance if, as there usually was, a trade-off between the two? Will you accept certain classes of errors more than others? Was it worse to have Type 1 errors or Type 2 errors? Could you justify the additional investment in improving F-1 given the likely use of the system by customers?
I hope richer data sources, sensors v. words, get more attention. LeCun and robotics researchers are both worth a lot of attention.
Sorry responding to the least relevant (in the bigger picture) piece, how come your sleep improved so much? Were you checking Instagram in bed before sleep?