Seeing real value from AI depends on being able to verify its outputs

Seb Murray

Jun 8, 2026 5 minute read

What you’ll learn:

AI can now produce work faster than humans can verify it.
Closing that gap is becoming the central economic challenge of the transition to artificial general intelligence.
Firms that can go beyond AI deployment to underwrite the risks of AI outputs will have a competitive advantage.

Artificial intelligence can now generate thousands of lines of code in minutes, and many companies are taking advantage of that capability. However, what AI cannot yet do reliably at scale is ensure that its code is safe, correct, or complete.

Humans try to fill that gap, reviewing outputs line by line, though that is becoming less workable as systems produce more code than any individual can realistically audit. Often, the code ships anyway.

The gap between fast AI output and slower human verification is at the heart of a new paper by MIT Sloan School of Management research scientist and his co-authors, Xiang Hui of Washington University and Jane Wu, SM ’18, PhD ’22, of UCLA.

In “Some Simple Economics of AGI,” the researchers lay out an economic model of the transition toward artificial general intelligence — AI systems that can operate with broad autonomy across many tasks. The researchers focus on the problem of measurability: whether the outputs of those systems can be reliably checked.

As AI systems are becoming more capable, it’s getting harder to verify everything they produce. This will put a cap on how fully the benefits of AGI can be realized in the economy: AI makes it cheap to produce work, but not to judge whether that work is any good.

Verification will become a core part of seeing value from AI

AI has largely been considered as a substitute for labor, with the assumption that cheaper output can translate directly into value. But what will distinguish firms is not so much their ability to deploy AI as their ability to stand behind what it produces, the researchers write.

The gap between what AI systems can produce and what can be properly checked is widening. On the SWE-bench AI performance benchmark, the accuracy of AI coding tools rose from 4.4% to 71.7% in a year, and the length of tasks that systems can complete is doubling over short periods, according to the researchers. But there is “scarce capacity” for human verification, as bandwidth continues to be constrained by time and experience, the researchers write.

Verifying AI outputs is no longer just a compliance function but a core part of how value is created from AI, the researchers write. That puts a premium on records of how systems behave — especially where they fail — and on taking responsibility when things go wrong.

Catalini describes this as a shift from software as a service to what he calls “liability as a service.” “The companies that understand the risks and can underwrite them will be the ones that profit,” he said.

The risks of using AI to verify AI and skipping verification altogether

So far, AI adoption has clustered in areas where outputs can be checked quickly: summarizing text, generating images, writing code. But as systems take on longer, higher-risk tasks — and as AI agents act autonomously — checking whether they were done correctly will become more difficult and often take more time, increasing the risk of misplaced trust.

One response from companies is to use AI to check AI, which the researchers call a “tempting shortcut.” But where both systems share the same assumptions, they can reinforce the same errors, creating what Catalini described as a false sense of confidence and not a real solution.

For some companies, competitive pressure and the gap between fast, cheap production and slower verification creates the motivation to deploy systems before they are fully checked by humans, according to the researchers, allowing risks to build unnoticed until they are harder to contain.

But when systems are pushed into use before they are fully verified, the consequences can be severe. Catalini pointed to episodes such as the 2010 flash crash in financial markets, where complex automated systems failed in ways that were not fully understood at the time. “If we do not invest in verification, we’re accumulating hidden risk,” Catalini said. “It is technical debt accumulating behind the scenes, and, at some point, it’ll come due.”

In effect, the economy becomes “hollow,” the researchers write: Output surges, but the quality and utility of the output don’t keep pace. The researchers describe this as a “Trojan horse” problem, where unverified output leaks into the economy and is treated as if it were reliable.

Why “human-in-the-loop” strategies may not hold

As demand for verification grows, employees with the right skills are becoming scarcer, the researchers write.

Verification skills depend on experience. As AI is beginning to take over more entry-level work, it is starting to erode the training ground through which workers build that experience — a problem Catalini described as a “missing junior loop.”

“The ladder is breaking,” he said. Early signs are already visible. Employment among younger workers in AI-exposed roles has fallen by around 16%, according to research cited in the paper.

Meanwhile, more senior employees are generating the data used to train the automated systems that could replace them.

Implications of the verification/automation gap

These shifts point to different implications for companies, individuals and policymakers. For all stakeholders, it is important to design ways to verify AI agents’ output at a rate that keeps pace with deployment.

For firms, the challenge is no longer just to deploy AI systems but to manage how they are used in practice. Companies should aim to scale automation only as fast as it can be trusted. That means understanding the risks and limitations of AI systems and taking responsibility for outcomes at the organizational level.

For individuals, the recommendation is to move away from routine execution and toward directing work with AI, exercising judgment, and taking responsibility for the outcome. “Humans will need to think about how to move up the value chain,” Catalini said.

For policymakers, Catalini’s recommendation is less about slowing development than shaping incentives. He argued that calls to limit AI’s use to only augmenting human work are unlikely to hold in practice, given the competitive pressure to deploy the technology.

The more realistic response is to ensure that verification and safety are built into how these systems are used — for example, by investing in tools to monitor outputs, and bringing humans back into the loop when needed.

Read the research: “Some Simple Economics of AGI”

Christian Catalini is a research scientist at the MIT Sloan School of Management and the founder of the MIT Cryptoeconomics Lab. Previously, he co-founded Lightspark and was a co-creator of Diem (formerly Libra), served as chief economist of the Diem Association, and was head economist at Meta FinTech. His research focuses on blockchain technology and cryptocurrencies; previously, he worked on the economics of equity crowdfunding and startup growth, and the economics of scientific collaboration. He teaches the MIT Sloan Executive Education course AI Adoption: Driving Business Value and Impact.

Xiang Hui is an assistant professor of marketing at Washington University. He studies credible exchange in markets, with a focus on the economics of AI and digital platforms. He examines how trust mechanisms and algorithmic systems shape what firms, consumers, workers, and intermediaries can credibly claim, observe, learn, and verify and how those mechanisms affect platform governance, market performance, and welfare.

Jane Wu, SM ’18, PhD ’22, is an assistant professor of strategy at UCLA. She conducts research at the intersection of innovation, entrepreneurship, and strategy. Her current work focuses on the role of metrics in shaping innovation in firms, and she also examines the strategic choices of entrepreneurs.