Back to AI Research

AI Research

Is Capability a Liability? More Capable Language Mo... | AI Research

Key Takeaways

  • More Capable Language Models Make Worse Forecasts When It Matters Most This research investigates a counterintuitive phenomenon in...
  • We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology.
  • On these tasks, more capable models produce worse distributional forecasts.
  • A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.
  • A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.
Paper AbstractExpand

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
This research investigates a counterintuitive phenomenon in artificial intelligence: as language models become more capable, they actually perform worse at certain types of forecasting. While larger and more advanced models typically outperform smaller ones, this study identifies a specific "inverse scaling" effect in tasks involving time series that exhibit superlinear growth followed by sudden, disruptive changes—such as financial bubbles, hyperinflation, or disease outbreaks. The authors demonstrate that while these models are excellent at identifying and extrapolating trends, they often become overconfident, leading to significant errors when those trends inevitably break.

The Problem with Over-Extrapolation

The core issue identified by the researchers is how models handle the "upper tail" of their predictions. When a model observes a period of rapid, superlinear growth, it tends to project that growth forward aggressively. While this is accurate if the trend continues, it becomes a liability when a regime change occurs—such as a market crash or a public health intervention. The models anchor their lower-tail estimates but push their upper-tail estimates higher and higher to match the growth trajectory. When the crash happens, these elevated upper-tail predictions sit far above the actual outcome, causing the model to fail significantly in its distributional accuracy.

Testing Through Simulation and Real-World Data

To prove this, the authors developed a contamination-free benchmark called ForecastBench-Sim (FBSim), which uses data from strategy-game simulations to test forecasting under uncertainty. They also tested their findings against real-world historical data, including COVID-19 incidence, U.S. housing prices during the 2003–2006 bubble, and various hyperinflationary episodes. In every case, the pattern remained consistent: more capable models performed worse at long-range forecasting because they were more prone to "over-committing" to a growth trend that was destined to fail.

Why Standard Metrics Mask the Failure

A critical finding of the paper is that this failure is often invisible to standard evaluation methods. Most existing benchmarks use "single-threshold" metrics, which essentially ask the model a binary question (e.g., "Will the value be above X?"). Under these metrics, more capable models often appear to be improving. However, when the researchers used continuous, "tail-inclusive" scoring rules—which evaluate the entire probability distribution of a forecast—the relationship between capability and accuracy reversed. This suggests that current industry-standard benchmarks may be failing to detect how models behave during high-stakes, volatile events.

The Limits of Domain Knowledge

The researchers also explored whether simply telling a model what it is forecasting (e.g., identifying the data as a disease outbreak or an economic crisis) would fix the problem. They found that domain knowledge has an inconsistent effect. While it helped in some cases, it failed to stop the over-extrapolation in others, such as hyperinflation. Even when models were able to correctly identify that a crisis was occurring, they often "weighed and discarded" that information, choosing to prioritize the aggressive growth trend in their final forecast. The authors conclude that scale and post-training alone are unlikely to solve this, and they recommend that future evaluations must include continuous, unbounded measures of accuracy to ensure models are truly calibrated for the real world.

Comments (0)

No comments yet

Be the first to share your thoughts!