Two AI Metrics Diverged: Will it Make All the Difference? explores a fundamental question in AI governance: will the gap between powerful, expensive frontier models and cheaper, accessible models continue to widen, or will it eventually shrink? The authors argue that the answer is not a fixed technological certainty, but rather a result of how we choose to measure AI performance. By classifying metrics based on their mathematical behavior, the paper demonstrates that our choice of measurement can dictate whether we expect a future of concentrated power or widespread, democratized capability.
The "Meek" vs. "Mighty" Divide
The authors categorize performance metrics into two types: "meek" and "mighty." A metric is considered "meek" if the performance gap between a frontier model (trained with massive, exponentially increasing compute) and a "meek" model (trained with constant, smaller compute) eventually closes. A "mighty" metric, by contrast, is one where the frontier model maintains or grows its lead indefinitely.
The paper provides a clear mathematical rule for this: any metric that is bounded (such as a percentage-based benchmark capped at 100%) is inherently meek. Because these metrics have a ceiling, even a less powerful model will eventually reach a point of diminishing returns, causing the performance gap to vanish over time.
Why Measurement Choice Matters
The core challenge is that many common metrics are "meek" in their standard form but have "mighty" counterparts that reflect different real-world priorities. For example, a benchmark measuring a coding agent’s accuracy on a fixed test is bounded and therefore meek. However, if a user cares about the "number of nines of reliability"—the exponential reduction in error rates—the metric becomes unbounded and mighty.
This creates a significant policy dilemma. If policymakers use a meek metric to evaluate AI progress, they might conclude that frontier-level capabilities will naturally proliferate to everyone, suggesting that regulation is unnecessary. If they use a mighty metric, they might conclude that frontier models will always remain far ahead of the pack, suggesting that power will remain concentrated in the hands of a few wealthy actors.
The Role of Utility Functions
The authors emphasize that the "correct" metric depends on the user's utility function—how much a specific gain in capability actually matters for a real-world goal. A software engineer who only needs a model to pass a specific test within a set timeframe will see the world through a meek lens, where the performance gap between models feels small. Another engineer, tasked with maximizing the complexity of a project within a strict error tolerance, will see the world through a mighty lens, where the frontier model’s superior performance is always distinct and valuable.
Implications for Governance
The paper concludes that there is no single "correct" way to measure AI progress. Instead, analysts and policymakers must be aware that their choice of metric implicitly assumes a specific future. If a capability—such as synthetic biology or rhetorical persuasion—is measured in terms that are unbounded, we should expect frontier-level capabilities to remain concentrated. If the capability is bounded, we should expect those powers to diffuse widely. Understanding this distinction is a prerequisite for any effective policy response, as bounded and unbounded metrics can lead to diametrically opposed conclusions about the future of AI.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!