Back to AI Research

AI Research

Two AI Metrics Diverged: Will it Make All the Diffe... | AI Research

Key Takeaways

  • Two AI Metrics Diverged: Will it Make All the Difference?
  • explores a fundamental question in AI governance: will the gap between powerful, expensive frontier...
  • As exponential compute scaling continues, will the capabilities of frontier AI models outstrip what is accessible to developers on a small fixed budget?
  • Or will capabilities converge, with "meek models inheriting the earth"?
  • (2025b), we show that the answer depends on how we value and measure AI capabilities.
Paper AbstractExpand

As exponential compute scaling continues, will the capabilities of frontier AI models outstrip what is accessible to developers on a small fixed budget? Or will capabilities converge, with "meek models inheriting the earth"? Building on Gundlach et al. (2025b), we show that the answer depends on how we value and measure AI capabilities. We discuss conventional performance measures and show that, while validation loss shows a shrinking gap, on other metrics frontier models grow their lead forever. Classifying performance metrics by their functional forms in relation to training (and inference) compute, we provide tight mathematical conditions for determining which metrics favor meek models, and show that bounded performance metrics always do. But careful interpretation of performance metrics is essential: we show that many common bounded metrics have closely-related counterpart metrics that are unbounded (and vice versa). Determining the apt metric in a domain is a prerequisite for policy, since bounded and unbounded metrics may suggest opposing policy responses. If a particular capability -- like software engineering, synthetic biology, or rhetorical persuasiveness -- is unbounded when measured in the terms we care about, frontier-level capability will likely be concentrated in the hands of a few wealthy actors. Conversely, if that capability is instead bounded, frontier-level capabilities proliferate through meek models into the hands of the many.

Two AI Metrics Diverged: Will it Make All the Difference? explores a fundamental question in AI governance: will the gap between powerful, expensive frontier models and cheaper, accessible models continue to widen, or will it eventually shrink? The authors argue that the answer is not a fixed technological certainty, but rather a result of how we choose to measure AI performance. By classifying metrics based on their mathematical behavior, the paper demonstrates that our choice of measurement can dictate whether we expect a future of concentrated power or widespread, democratized capability.

The "Meek" vs. "Mighty" Divide

The authors categorize performance metrics into two types: "meek" and "mighty." A metric is considered "meek" if the performance gap between a frontier model (trained with massive, exponentially increasing compute) and a "meek" model (trained with constant, smaller compute) eventually closes. A "mighty" metric, by contrast, is one where the frontier model maintains or grows its lead indefinitely.
The paper provides a clear mathematical rule for this: any metric that is bounded (such as a percentage-based benchmark capped at 100%) is inherently meek. Because these metrics have a ceiling, even a less powerful model will eventually reach a point of diminishing returns, causing the performance gap to vanish over time.

Why Measurement Choice Matters

The core challenge is that many common metrics are "meek" in their standard form but have "mighty" counterparts that reflect different real-world priorities. For example, a benchmark measuring a coding agent’s accuracy on a fixed test is bounded and therefore meek. However, if a user cares about the "number of nines of reliability"—the exponential reduction in error rates—the metric becomes unbounded and mighty.
This creates a significant policy dilemma. If policymakers use a meek metric to evaluate AI progress, they might conclude that frontier-level capabilities will naturally proliferate to everyone, suggesting that regulation is unnecessary. If they use a mighty metric, they might conclude that frontier models will always remain far ahead of the pack, suggesting that power will remain concentrated in the hands of a few wealthy actors.

The Role of Utility Functions

The authors emphasize that the "correct" metric depends on the user's utility function—how much a specific gain in capability actually matters for a real-world goal. A software engineer who only needs a model to pass a specific test within a set timeframe will see the world through a meek lens, where the performance gap between models feels small. Another engineer, tasked with maximizing the complexity of a project within a strict error tolerance, will see the world through a mighty lens, where the frontier model’s superior performance is always distinct and valuable.

Implications for Governance

The paper concludes that there is no single "correct" way to measure AI progress. Instead, analysts and policymakers must be aware that their choice of metric implicitly assumes a specific future. If a capability—such as synthetic biology or rhetorical persuasion—is measured in terms that are unbounded, we should expect frontier-level capabilities to remain concentrated. If the capability is bounded, we should expect those powers to diffuse widely. Understanding this distinction is a prerequisite for any effective policy response, as bounded and unbounded metrics can lead to diametrically opposed conclusions about the future of AI.

Comments (0)

No comments yet

Be the first to share your thoughts!