MoreRSS

site iconMIT Technology ReviewModify

A world-renowned, independent media company whose insight, analysis, reviews, interviews and live events explain the newest technologies and their commercial, social and polit.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of MIT Technology Review

How cloud and AI transform and improve customer experiences

2025-05-09 23:17:09

As AI technologies become increasingly mainstream, there’s mounting competitive pressure to transform traditional infrastructures and technology stacks. Traditional brick-and-mortar companies are finding cloud and data to be the foundational keys to unlocking their paths to digital transformation, and to competing in modern, AI-forward industry landscapes. 

In this exclusive webcast, experts discuss the building blocks for digital transformation, approaches for upskilling employees and putting digital processes in place, and data management best practices. The discussion also looks at what the near future holds and emphasizes the urgency for companies to transform now to stay relevant. 

Learn from the experts

  • Digital transformation, from the ground up, starts by moving infrastructure and data to the cloud
  • AI implementation requires a talent transformation at scale, across the organization
  • AI is a company-wide initiative—everyone in the company will become either an AI creator or consumer

Featured speakers

Mohammed Rafee Tarafdar, Chief Technology Officer, Infosys

Rafee is Infosys’s Chief Technology Officer. He is responsible for the technology vision and strategy, sensing & scaling emerging technologies, advising and partnering with clients to help them succeed in their AI transformation journey and building high technology talent density. He is leading the AI First transformation journey for Infosys and has implemented population and enterprise scale platforms. He is the co-author of “The Live Enterprise” book and has been recognized as a top 50 technology global leader by Forbes in 2023 and Top 25 Tech Wavemaker by Entrepreneur India magazine in 2024.

Sam Jaddi, Chief Information Officer, ADT

Sam Jaddi is the Chief Information Officer for ADT. With more than 26 years of experience in technology innovation, Sam has deep knowledge of the security and smart home industry. His team helps to drive ADT’s business platforms and processes to improve both customer and employee experiences in the future. Sam has helped set the technology strategy, vision and direction for the company’s Digital transformation. Prior to Sam’s role at ADT, he served as Chief Technology Officer at Stanley, overseeing the company’s new security division, leading global integration initiatives, IT strategy, transformation and international operations.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

The Download: AI headphone translation, and the link between microbes and our behavior

2025-05-09 20:10:00

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

A new AI translation system for headphones clones multiple voices simultaneously

What’s new: Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

How it works: The system tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. Read the full story.

—Rhiannon Williams

Your gut microbes might encourage criminal behavior

A few years ago, a Belgian man in his 30s drove into a lamppost. Twice. Local authorities found that his blood alcohol level was four times the legal limit. Over the space of a few years, the man was apprehended for drunk driving three times. And on all three occasions, he insisted he hadn’t been drinking.

He was telling the truth. A doctor later diagnosed auto-brewery syndrome—a rare condition in which the body makes its own alcohol. Microbes living inside the man’s body were fermenting the carbohydrates in his diet to create ethanol. Last year, he was acquitted of drunk driving.

His case, along with several other scientific studies, raises a fascinating question for microbiology, neuroscience, and the law: How much of our behavior can we blame on our microbes? Read the full story.

—Jessica Hamzelou

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 How the Gates Foundation will end
Bill Gates will wind it down in 2045, after distributing most of his remaining fortune. (NYT $)
+ He estimates he’ll give away $200 billion in the next 20 years. (Semafor)
+ The foundation is shuttering several decades earlier than he expected. (BBC)

2 US Customs and Border Protection will no longer protect pregnant women
It’s rolled back policies designed to protect vulnerable people, including infants. (Wired $)
+ The US wants to use facial recognition to identify migrant children as they age. (MIT Technology Review)

3 DOGE is readying software to turbo-charge mass layoffs
After some 260,000 government workers have already been let go. (Reuters)
+ DOGE’s math doesn’t add up. (The Atlantic $)
+ One of its biggest inspirations is no fan of the program. (WP $)
+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review)

4 Scientists are using AI to predict cancer survival outcomes
In some cases, it’s outperforming clinicians’ forecasts. (FT $)
+ Why it’s so hard to use AI to diagnose cancer. (MIT Technology Review)

5 Apple is reportedly working on new chips for its smart glasses
But we’ll have to wait a few more years. (Bloomberg $)
+ What’s next for smart glasses. (MIT Technology Review)

6 Silicon Valley has a vision for the future of warfare
Military technologies are no longer solely the preserve of governments. (Bloomberg $)
+ Palmer Luckey on the Pentagon’s future of mixed reality. (MIT Technology Review)

7 AI companies don’t want regulation any more
Just a few short years after they claimed regulation was the best way of making AI safe. (WP $)

8 Forget SEO, GEO is where it’s at these days
Marketers are scrambling to adopt best Generative Engine Optimization practices now that AI is upending how we search the web. (WSJ $)
+ Your most important customer may be AI. (MIT Technology Review)

9 AI-generated recruiters are making job hunting even worse
Avatars can glitch out and stumble over their words. (404 Media)

10 A Soviet-era spacecraft is reentering Earth’s atmosphere
More than 50 years after it misfired on a journey to Venus. (Ars Technica)
+ The world’s next big environmental problem could come from space. (MIT Technology Review)

Quote of the day

“The picture of the world’s richest man killing the world’s poorest children is not a pretty one.”


—Bill Gates lashes out at Elon Musk’s cuts to USAID in an interview with the Financial Times.

One more thing

The great commercial takeover of low Earth orbit

NASA designed the International Space Station to fly for 20 years. It has lasted six years longer than that, though it is showing its age, and NASA is currently studying how to safely destroy the space laboratory by around 2030.

The ISS never really became what some had hoped: a launching point for an expanding human presence in the solar system. But it did enable fundamental research on materials and medicine, and it helped us start to understand how space affects the human body.

To build on that work, NASA has partnered with private companies to develop new, commercial space stations for research, manufacturing, and tourism. If they are successful, these companies will bring about a new era of space exploration: private rockets flying to private destinations. They’re already planning to do it around the moon. One day, Mars could follow. Read the full story.

—David W. Brown

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ It’s almost pasta salad time!
+ Who is the better fictional archaeologist: Indiana Jones or Lara Croft?
+ How a good night’s sleep could help to give you a long-lasting memory boost. 😴
+ How millennials became deeply uncool (allegedly)

A new AI translation system for headphones clones multiple voices simultaneously

2025-05-09 17:00:00

Imagine going for dinner with a group of friends who switch in and out of different languages you don’t speak, but still being able to understand what they’re saying. This scenario is the inspiration for a new AI headphone system that translates the speech of multiple speakers simultaneously, in real time.

The system, called Spatial Speech Translation, tracks the direction and vocal characteristics of each speaker, helping the person wearing the headphones to identify who is saying what in a group setting. 

“There are so many smart people across the world, and the language barrier prevents them from having the confidence to communicate,” says Shyam Gollakota, a professor at the University of Washington, who worked on the project. “My mom has such incredible ideas when she’s speaking in Telugu, but it’s so hard for her to communicate with people in the US when she visits from India. We think this kind of system could be transformative for people like her.”

While there are plenty of other live AI translation systems out there, such as the one running on Meta’s Ray-Ban smart glasses, they focus on a single speaker, not multiple people speaking at once, and deliver robotic-sounding automated translations. The new system is designed to work with existing, off-the shelf noise-canceling headphones that have microphones, plugged into a laptop powered by Apple’s M2 silicon chip, which can support neural networks. The same chip is also present in the Apple Vision Pro headset. The research was presented at the ACM CHI Conference on Human Factors in Computing Systems in Yokohama, Japan, this month.

Over the past few years, large language models have driven big improvements in speech translation. As a result, translation between languages for which lots of training data is available (such as the four languages used in this study) is close to perfect on apps like Google Translate or in ChatGPT. But it’s still not seamless and instant across many languages. That’s a goal a lot of companies are working toward, says Alina Karakanta, an assistant professor at Leiden University in the Netherlands, who studies computational linguistics and was not involved in the project. “I feel that this is a useful application. It can help people,” she says. 

Spatial Speech Translation consists of two AI models, the first of which divides the space surrounding the person wearing the headphones into small regions and uses a neural network to search for potential speakers and pinpoint their direction. 

The second model then translates the speakers’ words from French, German, or Spanish into English text using publicly available data sets. The same model extracts the unique characteristics and emotional tone of each speaker’s voice, such as the pitch and the amplitude, and applies those properties to the text, essentially creating a “cloned” voice. This means that when the translated version of a speaker’s words is relayed to the headphone wearer a few seconds later, it sounds as if it’s coming from the speaker’s direction and the voice sounds a lot like the speaker’s own, not a robotic-sounding computer.

Given that separating out human voices is hard enough for AI systems, being able to incorporate that ability into a real-time translation system, map the distance between the wearer and the speaker, and achieve decent latency on a real device is impressive, says Samuele Cornell, a postdoc researcher at Carnegie Mellon University’s Language Technologies Institute, who did not work on the project.

“Real-time speech-to-speech translation is incredibly hard,” he says. “Their results are very good in the limited testing settings. But for a real product, one would need much more training data—possibly with noise and real-world recordings from the headset, rather than purely relying on synthetic data.”

Gollakota’s team is now focusing on reducing the amount of time it takes for the AI translation to kick in after a speaker says something, which will accommodate more natural-sounding conversations between people speaking different languages. “We want to really get down that latency significantly to less than a second, so that you can still have the conversational vibe,” Gollakota says.

This remains a major challenge, because the speed at which an AI system can translate one language into another depends on the languages’ structure. Of the three languages Spatial Speech Translation was trained on, the system was quickest to translate French into English, followed by Spanish and then German—reflecting how German, unlike the other languages, places a sentence’s verbs and much of its meaning at the end and not at the beginning, says Claudio Fantinuoli, a researcher at the Johannes Gutenberg University of Mainz in Germany, who did not work on the project. 

Reducing the latency could make the translations less accurate, he warns: “The longer you wait [before translating], the more context you have, and the better the translation will be. It’s a balancing act.”

Your gut microbes might encourage criminal behavior

2025-05-09 17:00:00

A few years ago, a Belgian man in his 30s drove into a lamppost. Twice. Local authorities found that his blood alcohol level was four times the legal limit. Over the space of a few years, the man was apprehended for drunk driving three times. And on all three occasions, he insisted he hadn’t been drinking.

He was telling the truth. A doctor later diagnosed auto-brewery syndrome—a rare condition in which the body makes its own alcohol. Microbes living inside the man’s body were fermenting the carbohydrates in his diet to create ethanol. Last year, he was acquitted of drunk driving.

His case, along with several other scientific studies, raises a fascinating question for microbiology, neuroscience, and the law: How much of our behavior can we blame on our microbes?

Each of us hosts vast communities of tiny bacteria, archaea (which are a bit like bacteria), fungi, and even viruses all over our bodies. The largest collection resides in our guts, which play home to trillions of them. You have more microbial cells than human cells in your body. In some ways, we’re more microbe than human.

Microbiologists are still getting to grips with what all these microbes do. Some seem to help us break down food. Others produce chemicals that are important for our health in some way. But the picture is extremely complicated, partly because of the myriad ways microbes can interact with each other.

But they also interact with the human nervous system. Microbes can produce compounds that affect the way neurons work. They also influence the functioning of the immune system, which can have knock-on effects on the brain. And they seem to be able to communicate with the brain via the vagus nerve.

If microbes can influence our brains, could they also explain some of our behavior, including the criminal sort? Some microbiologists think so, at least in theory. “Microbes control us more than we think they do,” says Emma Allen-Vercoe, a microbiologist at the University of Guelph in Canada.

Researchers have come up with a name for applications of microbiology to criminal law: the legalome. A better understanding of how microbes influence our behavior could not only affect legal proceedings but also shape crime prevention and rehabilitation efforts, argue Susan Prescott, a pediatrician and immunologist at the University of Western Australia, and her colleagues.

“For the person unaware that they have auto-brewery syndrome, we can argue that microbes are like a marionettist pulling the strings in what would otherwise be labeled as criminal behavior,” says Prescott.

Auto-brewery syndrome is a fairly straightforward example (it has been involved in the acquittal of at least two people so far), but other brain-microbe relationships are likely to be more complicated. We do know a little about one microbe that seems to influence behavior: Toxoplasmosis gondii, a parasite that reproduces in cats and spreads to other animals via cat feces.

The parasite is best known for changing the behavior of rodents in ways that make them easier prey—an infection seems to make mice permanently lose their fear of cats. Research in humans is nowhere near conclusive, but some studies have linked infections with the parasite to personality changes, increased aggression, and impulsivity.

“That’s an example of microbiology that we know affects the brain and could potentially affect the legal standpoint of someone who’s being tried for a crime,” says Allen-Vercoe. “They might say ‘My microbes made me do it,’ and I might believe them.”

There’s more evidence linking gut microbes to behavior in mice, which are some of the most well-studied creatures. One study involved fecal transplants—a procedure that involves inserting fecal matter from one animal into the intestines of another. Because feces contain so much gut bacteria, fecal transplants can go some way to swap out a gut microbiome. (Humans are doing this too—and it seems to be a remarkably effective way to treat persistent C. difficile infections in people.)

Back in 2013, scientists at McMaster University in Canada performed fecal transplants between two strains of mice, one that is known for being timid and another that tends to be rather gregarious. This swapping of gut microbes also seemed to swap their behavior—the timid mice became more gregarious, and vice versa.

Microbiologists have since held up this study as one of the clearest demonstrations of how changing gut microbes can change behavior—at least in mice. “But the question is: How much do they control you, and how much is the human part of you able to overcome that control?” says Allen-Vercoe. “And that’s a really tough question to answer.”

After all, our gut microbiomes, though relatively stable, can change. Your diet, exercise routine, environment, and even the people you live with can shape the communities of microbes that live on and in you. And the ways these communities shift and influence behavior might be slightly different for everyone. Pinning down precise links between certain microbes and criminal behaviors will be extremely difficult, if not impossible. 

“I don’t think you’re going to be able to take someone’s microbiome and say ‘Oh, look—you’ve got bug X, and that means you’re a serial killer,” says Allen-Vercoe.

Either way, Prescott hopes that advances in microbiology and metabolomics might help us better understand the links between microbes, the chemicals they produce, and criminal behaviors—and potentially even treat those behaviors.

“We could get to a place where microbial interventions are a part of therapeutic programming,” she says.

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

The Download: AI benchmarks, and Spain’s grid blackout

2025-05-08 20:10:00

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 as a way to evaluate an AI model’s coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” Entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement. Read the full story.

—Russell Brandom

Did solar power cause Spain’s blackout?

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insist that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. Here are a few takeaways that could help our future grid. 

—Casey Crownhart

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 The Trump administration will repeal some global chip curbs 
It’s drawing up new rules that prioritize direct negotiations with various nations. (Bloomberg $)
+ The curbs have always been leaky anyway. (Economist $)

2 India and Pakistan have accused each other of overnight drone attacks
The conflict between the two countries is rapidly escalating. (The Guardian)
+ Pakistan claims to have shot down 25 drones in its airspace. (Reuters)
+ Mass-market military drones have changed the way wars are fought. (MIT Technology Review)

3 The FDA is interested in using AI for drug evaluation
And has met with OpenAI to hear more about how to do it. (Wired $)
+ An AI-driven “factory of drugs” claims to have hit a big milestone. (MIT Technology Review)

4 The US is pushing nations facing its tariffs to adopt Starlink
Government officials in India and other countries have fast tracked approvals. (WP $)
+ India recently announced new rules for satellite internet providers. (Rest of World)

5 Apple is overhauling its Safari browser to focus on AI search
Its search volume is down for the first time in 22 years. (The Verge)
+ Apple exec Eddy Cue thinks AI search will replace traditional search engines. (Bloomberg $)
+ AI means the end of internet search as we’ve known it. (MIT Technology Review)

6 Mark Zuckerberg is betting big on AI chatbots
He’s on a media charm offensive to convince us that AI friends are the future. (WSJ $)
+ The AI relationship revolution is already here. (MIT Technology Review)

7 Students can’t wean themselves off ChatGPT
And experts fear that they’ll emerge into the workforce essentially illiterate. (NY Mag $)
+ Some educators believe that AI highlights how the ways we teach need to change. (MIT Technology Review)

8 We don’t really know how memory works 🧠
But these researchers are doing their best to find out. (Quanta Magazine)

9 The vast majority of the sea depths are still unexplored
What lies beneath is a mystery. (New Scientist $)
+ Meet the divers trying to figure out how deep humans can go. (MIT Technology Review)

10 Pet psychics are taking over TikTok 🔮
But does your furry friend have anything to say?(NYT $)
+ Humans are still better than AI at futuregazing—for now. (Vox)
+ How DeepSeek became a fortune teller for China’s youth. (MIT Technology Review)

Quote of the day

“It’s like living in hell.”

—Elizabeth Martorana, a Virginia resident, describes what it’s like to live in a development zone for Amazon, Microsoft, and Google data centers, Semafor reports.

One more thing

How Antarctica’s history of isolation is ending—thanks to Starlink

“This is one of the least visited places on planet Earth and I got to open the door,” Matty Jordan, a construction specialist at New Zealand’s Scott Base in Antarctica, wrote in the caption to the video he posted to Instagram and TikTok in October 2023.

In the video, he guides viewers through the hut, pointing out where the men of Ernest Shackleton’s 1907 expedition lived and worked.

The video has racked up millions of views from all over the world. It’s also kind of a miracle: until very recently, those who lived and worked on Antarctic bases had no hope of communicating so readily with the outside world.

That’s starting to change, thanks to Starlink, the satellite constellation developed by Elon Musk’s company SpaceX to service the world with high-speed broadband internet. Read the full story.

—Allegra Rosenberg

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ Does Boston still drink? Not in the same way it used to.
+ Where in the US you should set up camp to stargaze right now.
+ Wow: this New Zealand snail lays eggs from its neck. 🐌
+ Jurassic World Rebirth is coming: and it looks suitably bonkers.

How to build a better AI benchmark

2025-05-08 17:00:00

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.