MoreRSS

site iconExponential ViewModify

By Azeem Azhar, an expert on artificial intelligence and exponential technologies.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Exponential View

🔮 The paradox of GPT-5

2025-08-15 01:47:33

The Financial Times quoted me last week calling GPT-5 “evolutionary rather than revolutionary.” Then it twisted the knife: “Release of eagerly awaited system upgrade has been met with a mixed response, with some users calling the gains ‘modest.’”

“Modest” is one of those quietly loaded words. The Oxford English Dictionary defines it as ‘relatively moderate, limited or small’. ‘Relatively’ does a lot of work there. Compared with GPT-4 in March 2023, GPT-5 is a huge leap. Compared with bleeding-edge models released just months ago, it feels incremental. And compared with the sci-fi hopes, it’s restrained.

Let’s inject some sense and sensibility into this modesty debate.

The funny thing is, GPT‑5 does what no model before it could, yet in the same breath makes its own shortcomings impossible to ignore.

Over the past week of using GPT‑5, I’ve been tracking these tensions. Today, I’ll break down the five paradoxes that define GPT-5’s release and help explain why so many people find it confusing.

These five paradoxes show how this can be the most capable model so far, yet still earn that stubborn label: ‘modest’.

Subscribe now

1. The moving-goalposts paradox

The smarter AI gets at our chosen benchmarks, the less we treat those benchmarks as proof of intelligence.

We measure machine intelligence through goalposts – tests, benchmarks and milestones that promise to tell us when a system has crossed from ‘mere software’ into something more. Sometimes these are symbolic challenges, like beating a human at chess or passing the Turing test. Other times they are technical benchmarks: scoring highly on standardised exams, solving logic puzzles or writing code1.

These goalposts serve two purposes: they give researchers something to aim for and they give the rest of us a way to judge whether progress is real. But they are not fixed. The moment AI reaches one goalpost, we often decide it was never a real measure of intelligence after all.

The first goalpost to shift was the Turing test.

Proposed in 1950 by Alan Turing, the “imitation game” offered a practical way to sidestep the slippery question “can machines think?”. Instead of debating definitions, Turing suggested testing whether a machine could respond in conversation so convincingly that an evaluator could not reliably tell it from a human.

I propose to consider the question, ‘Can machines think?’ This should begin with definitions of the meaning of the terms ‘machine’ and ‘think.’ The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ‘machine’ and ‘think’ are to be found by examining how they are commonly used, it is difficult to escape the conclusion that the meaning and the answer to the question ‘Can machines think?’ is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described in terms of a game which we call the ‘imitation game.’

undefined
Alan Turing. Source: computerhistory.org

For decades the test stood as the symbolic summit of AI achievement. Then, in June 2014 – four years before GPT‑1 – a chatbot named Eugene Goostman became the first to “pass”. Disguised as a 13‑year‑old Ukrainian boy, it fooled 33% of judges in a five‑minute exchange. Its strategy was theatrical misdirection: deflect tricky questions, lean on broken English and exploit the forgiving expectations we have of a teenager. As observed at the time:

The winners aren’t genuinely intelligent; instead, they tend to be more like parlor tricks, and they’re almost inherently deceitful. If a person asks a machine “How tall are you?” and the machine wants to win the Turing test, it has no choice but to confabulate. It has turned out, in fact, that the winners tend to use bluster and misdirection far more than anything approximating true intelligence.

Earlier this year, a paper claimed that GPT‑4.5 passed a more rigorous, three‑party Turing test, with judges rating it human in 73% of five‑minute conversations. Whether this counts as a pass is still contested. On the one hand, Turing’s test measured substitutability – how well a machine can stand in for a human – not genuine understanding. On the other, critics argued that short exchanges are too forgiving and that a meaningful pass would require longer, open‑ended dialogue.

But if we say that AI has passed the Turing test, what does that even mean? The victory feels hollow. Once systems beat the Turing test, we moved the bar: from conversation to formal benchmarks. LLMs went on to crush many of these, too. Yet the same pattern holds: the smarter the system, the less its achievements feel like proof of intelligence.

The benchmarks of the past.

Here the conversation shifts from tests we can name to a target we cannot agree on. Part of the problem is definitional: there is no consensus on what artificial general intelligence is. Is it matching human cognition across all domains, or being a flexible, self‑improving agent? Intelligence resists collapsing into a single score. Decades of IQ debates show that. Is AGI a universal problem‑solver, an architecture mirroring human thought, or a form of consciousness? With such a hazy target, success will always feel provisional.

Sam Altman now calls AGI ‘not a super useful term.’ I’ve long found the term problematic; it’s not an accurate descriptor of what LLMs are or their usefulness. Suppose a system were truly “intelligent” in the human sense. Couldn’t we train it only on knowledge up to Isaac Newton and watch it rediscover everything humanity has learned in the 300 years since? By that standard, GPT‑5 is nowhere close – and I did not expect it to be. Its goal was not raw knowledge accumulation, which arguably defined GPT-4’s leap. GPT-5’s focus was on action: better tool use and more agentic reasoning.

GPT-5 performs better on some benchmarks measuring agentic tasks and is on a par with others.2 And compared with the last generation, the raw jumps are striking: on GPQA Diamond (advanced science questions), GPT-4 scored 38.8%, GPT-5 scored 85.7%;3 on ARC-AGI-1, GPT-4o managed 4.5%, GPT-5 hit 65.7%.

Yet the wow factor is muted. Most people are not measuring GPT‑5 against GPT‑4 from March 2023. They are stacking it against o3 from just a few months ago. Frontier models arrive in rapid succession, the baseline shifts at speed and each breakthrough lands half‑forgotten. In that light, even a giant’s stride can feel like treading water.

2. The reliability paradox

As systems grow more reliable, their rare failures become less predictable and more jarring. Trust can stagnate – or even decline – despite falling error rates.

On paper, GPT‑5 should be more reliable than previous LLMs. OpenAI’s launch benchmarks suggest it hallucinates far less than o3, especially on conceptual and object‑level reasoning.

In my own use, hallucinations feel rarer. GPT‑5 Thinking aced a 51‑item, nine‑data‑point analysis I gave it and added a derivative analysis I had not asked for. Claude Opus 4.1, by contrast, miscounted the items and gave weaker recommendations. GPT‑5’s output took me 30 minutes to verify in Excel – not because it was wrong, but because the data format was awkward. Across simpler tasks, this is the pattern: more accurate, more often.

The problem is when it is wrong. During a recent trip to Tokyo, I asked GPT‑5 to name the city’s oldest Italian restaurant while standing under it. It named a different place, yet also knew the full history of the restaurant I was in when I prompted it harder. The same kind of jarring mistake popped up in OpenAI’s live demo, where GPT‑5 botched the Bernoulli effect. These errors are not frequent, but they are unpredictable, and that makes them dangerous.

Psychologists call this automation complacency: the more reliable a system is, the less closely we watch it, and the more likely rare errors are to slip through. With GPT‑4‑level error rates, I stayed alert for slip‑ups; with GPT‑5, I can feel myself letting my guard down. The brain’s ‘error detection’ system habituates, so vigilance drops.

This risk compounds in agentic workflows. Even with a 1% hallucination rate, a 25-step autonomous process has roughly a 22% chance of at least one major error. For enterprise use, that is still too high. Last week, AWS released Automated Reasoning Checks, a formal-verification safeguard that encodes domain rules into logic and mathematically tests AI outputs against them. They tout “up to 99% accuracy.” This will help, but it’s not the last word.

Nevertheless, when mistakes are rarer yet less predictable, perceived reliability does not climb as much as the benchmarks suggest. That is why GPT‑5’s improved accuracy can still feel like a modest leap. The progress is real, but it does not fully translate into user confidence.

3. The benevolent-control paradox

The more capable the assistant, the more its “helpful” defaults shape our choices – turning empowerment into subtle control.

Read more

📈 Data to start your week

2025-08-11 21:09:40

Here’s your Monday round-up of data driving conversations this week — all in less than 250 words.

Subscribe now


  1. Nvidia on top ↑ Nvidia now makes up nearly 8% of the S&P 500, the highest weighting of any stock in the index’s history.

  1. Market concentration ↑ The net income of the S&P 500’s ten largest companies has grown by ~180% since 2019, while the rest of the index grew by just ~45%.

  1. Cloud competition ↑ Microsoft Azure captured 44.5% of new cloud revenue in Q2, outpacing that of the current market leader AWS (30%).

Read more

🔮 Sunday edition #536: Agents over clicks. Youth on thin ice. When autocracy pays. Worlds you can walk through. A drug that slows the clock++

2025-08-10 10:09:58

“Always insightful and refreshingly free of an agenda other than the intellectual pursuit of knowledge in how tech shapes our world.” – Vincent, a paying member

Subscribe now


Hi all,

Welcome to our Sunday edition, where we explore the latest developments, ideas, and questions shaping the exponential economy.

Enjoy the weekend reading!

Azeem


A new era of the Web

This week’s big news was the release of GPT-5 (my initial take here, quoted in the FT here), but something bigger is brewing beneath the surface: a fundamental transformation of how the Web itself operates.

A paper I read this week takes stock of the transition from the recommendation paradigm to the action paradigm – a shift toward an agentic Web with fundamentally different incentive structures.

Source: Yang, et al. “Agentic Web: Weaving the Next Web with AI Agents” 2025

Yang et al. outline three enablers of this change: intelligence, interaction and a nascent economy. The first two form a technical layer: agents that can reason and plan, and protocols that let them communicate. The intelligence pillar is progressing – models can handle longer tasks over time, although reliability remains a concern.

Source: METR

The interaction pillar has some firm foundations. Agents need a shared grammar to talk to websites, APIs and one another. Nascent protocols such as MCP (agents ⇄ tools) and A2A (agent ⇄ agent discovery) promise a common interaction layer. Security remains a worry (see the lethal trifecta), but protection is improving.

The economy pillar is by far the least developed. Attention (and money) flowed through search links and social feeds in the previous era. Now, in some cases, AI answers are swallowing up as much as half that traffic. Publishers have responded by blocking crawlers or charging for access. Cloudflare introduced ‘pay-to-crawl’ gates. While it’s a good experiment, it has some flaws, as we discussed a few weeks ago:

A $0.01 crawl fee might sound small, but it’s ~20x more than the average revenue a human visit generates. It can get much worse: an AI might need 10 pages to answer a question, making it 200x more expensive. So realistically, will AI companies pay that much per page? Probably not. More likely, they’ll keep striking licensing deals or stick to scraping public-domain content.

But the market will not settle until a standard primitive emerges. The paper envisions an Agent Attention market where the scarce resource is an agent’s choice of which API, tool, or external service to invoke when completing a task for a human user. And crucially…

Read more

🔮 Feel the AGI yet?

2025-08-08 20:28:05

OpenAI unveiled the much-delayed GPT-5 yesterday. I’m still away on a family holiday, but I wanted to share my early impressions here. They’ll evolve as we push the model.

Does the new release make us ‘feel the AGI’?

GPT-5 outperforms GPT-4 by a wide margin but shows only slight gains over o3, Anthropic’s Claude 4, and xAI’s Grok 4 on many tasks. In hands-on use, though, it behaves more agentically –taking initiative and stitching steps together more capably.

I’d characterize the new release as evolutionary rather than revolutionary. But we might still be surprised as new patterns emerge.

In today’s post, we’ll cover:

  • GPT-5 and the rise of personal micro-software

  • Hallucinations, factual accuracy and groundedness

  • What this release signals about OpenAI’s strategy and the competitive field

  • Thoughts on GPT-5 & AGI

Subscribe now

But first, how GPT-5 came to be

To train the new model, OpenAI used o3 to generate training data, as Sebastien Bubeck explained:

We used OpenAI’s o3 to craft a high-quality synthetic curriculum to teach GPT-5 complex topics in a way that the raw web simply never could.

This could be repeated. Using the previous generation to create synthetic training data for the next. So far, there is no hard limit to how many times this can be done. As long as each new teacher is stronger than the last and synthetic data is added rather than replacing the original, research suggests each generation of models should improve. Whether it can solve all remaining problems, like memory, hallucinations and context-management, remains up in the air.

An era of personal micro-software?

The AGI vibe comes from the agentic UX. It feels like the AI is doing things for you, rather than you telling it what to do. You give it a direction, and it’ll make most of the choices. This further lowers the barrier to creating the tools you need, without code or specialized knowledge. Take the Korean learning app my team built, inspired by the OpenAI demo: it asks for your preferences, then instantly produces a working flash-card app with real audio and a matching game. It’s the kind of personalized micro-software that anyone could soon create.

And personalized micro-software could be a really big deal. In just a couple of prompts, created a budgeting app personalized to his bank and preferred budgeting method.

I spoke about this recently with GitHub’s CEO Thomas Dohmke. He recounted Manus AI’s founders using their agent to spin up a single-purpose micro-app to scout Tokyo office space. GPT-5 could put that capability in everyone’s hands.

GPT-5 approaches tasks in a completely different way to previous systems. It pauses to think, uses a tool, reflects – perhaps tries a different tool – then acts again. It reasons through the process. Claude 4, by contrast, tends to move straight through chaining tool calls together without that intermediate reflection.

As notes:

GPT-5 doesn’t just use tools. It thinks with them. It builds with them.

Why choose?

Another major upgrade is that you no longer have to agonize over choosing models or micromanaging parts of the system to get the best output. GPT-5 decides how to answer, how much reasoning to do and which tools to use. This kind of model-routing is already increasingly common in enterprises running fleets of LLMs. It significantly reduces cognitive load by shifting those decision to the system.

You no longer need to fuss over every detail in the prompt, either. You can give vague instructions and the system will figure out the details, often delivering more than you asked for. In the Korean app example, it added features I didn’t ask for nor would have thought of. noticed the same thing:

I used to prompt AI carefully… now I can just gesture vaguely at what I want.

While you can still steer the AI, OpenAI’s design makes deep, critical engagement feel optional – a potentially harmful direction. My son tested GPT-5 with a math problem. It got the right answer, as did GPT-4o and Gemini Pro 2.5 – but it hid its working. That’s not a habit to normalize. LLMs are more useful when they show their reasoning, so you can learn from them or check their work.

An MIT Media Lab pre-print found that participants who wrote essays with ChatGPT showed lower EEG activity in regions tied to executive control and produced more formulaic text. The default behavior might drift toward ever more uncritical prompting. We reflected on this recently, learning from our own mistakes.

Subscribe now

A tighter grip on reality

For an agent to track and execute its plan, it must reason across thousands of words. GPT-5 leads by seven percentage points on a benchmark tracking this performance.1 Just as important, GPT-5 hallucinates roughly 6-8x less often than OpenAI o3, depending on the benchmark.2

However, new failure modes always have a way of appearing. A few moments in yesterday’s presentation rattled the confidence. One slide committed a glaring chart crime (apparently 52.8 > 69.1 = 30.8),3 and one workflow demo had GPT-5 return an incorrect theory on how plane wings generate lift. No doubt, it will not always cooperate. In our first attempt to build a web app with GPT-5, it kept insisting that we had to write the code ourselves. The issues we are used to will remain, but I expect to encounter them less often.

Spot the chart crime.

The benchmark we’re paying most attention to is METR’s ‘Measuring AI Ability to Complete Long Tasks’ benchmark. GPT-5 can now complete software tasks averaging 2 hours and 17 minutes in length with a 50% completion rate, up from 1 hour and 30 minutes for o3. But if you want an 80% completion rate, the maximum task length drops to tasks of about 25 minutes — only slightly longer than o3 and Claude Opus 4, which average around 20 minutes.

In other words, the ceiling has lifted more than the floor. If we want agents to take on longer, more complex tasks end-to-end, that floor will need to rise quickly (see EV#534 on this).

Still, the trajectory is promising. GPT-5 can stitch together small, vibe-coded software with striking accuracy and, dare I say, creativity. Watching it succeed at tasks it once fumbled is genuinely exciting.

Market share by giveaway

And that delight will soon reach everyone. About 700 million people use ChatGPT weekly, and all of them now have access to GPT-5. You don’t have to pay. There are rate limits, of course, but the offer feels extremely generous. It also means most consumers will be embedded in their models, rather than their rivals. This will entrench OpenAI’s consumer lead.

On the developer side, the API is competitively priced and beats other foundational providers. Claude 4 Opus, the closest rival in developer tasks, matches GPT-5 yet costs roughly 10x more. That’s a concern for Anthropic, given that 60% of its revenue comes from API products, half of which depend on developer tools like Cursor and GitHub Copilot. Even Cursor’s CEO says GPT-5 is the smartest coding assistant he has tested. Not good news for Anthropic.

More intriguing, from a strategy angle, is that OpenAI launched GPT-5 just days after releasing two frontier-pushing open-weight models.

Read more

📈 Data to start your week

2025-08-04 21:32:06

Hi all,

The much-loved Monday round-up of data is back. Short and sweet as always – the latest market indicators across AI and technology in less than 250 words.

If you like it, share it widely.


  1. InfrAI spend ↑ OpenAI plans to spend approximately $90 billion on server infrastructure between 2025 and 2027. Meanwhile, Alphabet, Microsoft, Amazon and Meta are set to spend nearly $400 billion on capital expenditure this year alone.

  2. Anthropic ↑ Their latest valuation is now $170 billion, 3x in just five months.

  3. AI summaries ↑ It is estimated that at least 13.5% of PubMed abstracts in 2024 were processed using LLMs.

  4. AI > Offices ↑ Data center construction spending in the US has more than doubled since ChatGPT’s launch.

Read more

🔮 Sunday edition #535: Generalist robots; AGI & debt; energy realism; AI talent wars, fertility math, attitudes++

2025-08-03 10:16:17

Hi all,

Welcome to our Sunday edition, where we explore the latest developments, ideas, and questions shaping the exponential economy.

Thanks for reading!

Azeem


We don’t need miracles

The debate over clean energy’s future is increasingly polarized over how far current solutions can take us. One camp argues that the transition is already stalling under the weight of political and economic constraints. The other believes in scaling newer, more ambitious technologies despite their high cost or unproven readiness. There is a pragmatic middle, however. My friend , as always, offers a grounded take. His model shows that as long as clean energy continues to outgrow overall energy demand by a few percentage points annually, fossil fuels will inevitably be squeezed out of the system.

The transition doesn’t depend on breakthroughs or miracles. It depends on compounding growth, year after year, where clean energy keeps expanding faster than demand. In 2000, much of the global energy system was either unelectrifiable or stuck in the “technical but uneconomic” zone. By 2025, a substantial share of final energy demand across buildings, industry and transport is economically electrifiable.

Source: Ember via X/ShanuMathew93

If the energy transition depends on compounding progress, China’s nuclear strategy is a lesson in what that compounding can deliver. The country has defied the global trend of rising nuclear costs. It now delivers reactors at about $2–$3 per watt – far less than recent US projects like Vogtle 3 and 4, which have hit up to $15 per watt. China scaled through standardized designs, local supply chains and a stable industrial policy. As a result, it could overtake the US in nuclear capacity by the early 2030s.


Token production = debt reduction?

Coatue Management argues that AGI could stabilize the US debt-to-GDP ratio1 around 100% by 2034. With artificial superintelligence, it might even fall to 80%. This is well below current projections of ~120–140%.

It’s an enticing vision. The full keynote by Philippe Laffont, the firm’s co-founder, is worth watching. But the forecast seems to assume a causal chain: more intelligence leads to more productivity, which lifts GDP and eases the debt burden2. Our economy, though, is anything but linear. If AGI’s productivity gains accrue primarily to capital, workers could see their incomes stagnate or decline even amid rapid economic growth. Research from the Philadelphia Fed shows that labor’s share of income has fallen since 2000 (see here and here). If that trend continues and corporate taxes remain porous, governments may struggle to raise sufficient revenue to fund public services or counter rising inequality. In that world, the cost of running an AI puts a cap on wages. The benefits of growth go to those who own the machines, not the workers they replace.

Still, Coatue’s provocation is a useful springboard for ideation. If AGI reconfigures productivity itself, how should we rethink the metrics and institutions around it? What replaces GDP when so much economic activity is generated by open-weight models or intelligence priced at zero? Could governments one day issue cognition-backed bonds – claims not on future labor, but on future machine-generated services? And if human work no longer anchors the tax base, do we begin taxing compute or auctioning AI time?

I discussed some of these ideas for the next twenty years with economist Tyler Cowen, if you’d like to dig deeper.

Exponential View is supported by its readers. Thanks to all the members who help keep our work ambitious and evolving.


A general-purpose robot

We may be further along the robotics maturity curve than is widely appreciated, argue and his team at SemiAnalysis. They’ve mapped out a five-stage framework tracking how robots are progressing from rigid, pre-programmed machines (Level 0) to autonomous systems capable of fine, human-like manipulation (Level 4). We’re now in Levels 2 and 3: robots are navigating messy environments and performing some low-skill tasks like folding laundry, cooking and warehouse restocking.

This is as much about technology as it is about economics and labor markets. In parcel logistics, SemiAnalysis estimates that ten robots can match the output of 23 human sorters. They go on to show that per-pick costs for robots fall below human rates in just over a year.

As robots gain dexterity and tactile intelligence, we’ll step into Level 4 where the scope of automatable work could expand dramatically. Level 4 would cross into tasks we thought robots couldn’t touch, like skilled trades, fine-grain manufacturing, or even caregiving.

Elsewhere:

In tech + AI:

  • OpenAI’s rocky road to GPT-5 mirrors the broader industry’s move toward maturity, where sustaining innovation requires new methods and strategic adaptation.

  • The talent wars are in full swing. Apple has lost its fourth AI researcher in a month to Meta. Zuck is also targeting Mira Murati’s team, offering as much as $1 billion in multi-year contracts. All have refused (so far).

  • A stealth language model named ‘horizon-alpha’ – widely believed to be from OpenAI – has taken the top spot on EQ-Bench, a benchmark that evaluates emotional intelligence in language models. It also ranks high in longform and creative writing.

  • Chinese startup Manus is building a platform to orchestrate teams of AI agents for complex research and reasoning tasks. Manus is looking to outmaneuver bigger Western players by breaking down hard problems into auditable, multi-agent workflows. For a deep dive on the future of orchestrating AIs, see our essay about the billion-agent future.

  • Neuralink is launching its first clinical trial in the UK.

  • For the first time in humans, scientists have reprogrammed a patient’s own stem cells to continuously produce cancer-fighting T cells. This could pave the way for durable, self-renewing immunotherapy treatments.

In society + culture:

  • Demographer argues that the much-discussed decline in fertility is largely an artefact of how we measure it. He argues we should track the number of children who survive to puberty, rather than merely counting births. In most cases, the “decline” in total fertility rates is the result of improved child survival, not changes in parental reproductive intent.

  • A study of over 300 music teachers in China reveals that positive attitudes toward technology matter more for adoption than technical competence alone. Teachers with strong tech skills won’t use technology unless they first believe it’s beneficial and easy to integrate.

Inside companies:

Thanks for reading! Today’s edition is open to everyone – if you found it valuable, share it widely.

1

Debt-to-GDP measures a country’s total public debt relative to its economic output. Lower ratios generally indicate more fiscal room and stability.

2

Note: We don’t have access to Coatue’s underlying model, a key input in any analysis.