Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
This research investigates a counterintuitive phenomenon in artificial intelligence: as language models become more capable, they actually perform worse at certain types of forecasting. While larger and more advanced models typically outperform smaller ones, this study identifies a specific "inverse scaling" effect in tasks involving time series that exhibit superlinear growth followed by sudden, disruptive changes—such as financial bubbles, hyperinflation, or disease outbreaks. The authors demonstrate that while these models are excellent at identifying and extrapolating trends, they often become overconfident, leading to significant errors when those trends inevitably break.
The Problem with Over-Extrapolation
The core issue identified by the researchers is how models handle the "upper tail" of their predictions. When a model observes a period of rapid, superlinear growth, it tends to project that growth forward aggressively. While this is accurate if the trend continues, it becomes a liability when a regime change occurs—such as a market crash or a public health intervention. The models anchor their lower-tail estimates but push their upper-tail estimates higher and higher to match the growth trajectory. When the crash happens, these elevated upper-tail predictions sit far above the actual outcome, causing the model to fail significantly in its distributional accuracy.
Testing Through Simulation and Real-World Data
To prove this, the authors developed a contamination-free benchmark called ForecastBench-Sim (FBSim), which uses data from strategy-game simulations to test forecasting under uncertainty. They also tested their findings against real-world historical data, including COVID-19 incidence, U.S. housing prices during the 2003–2006 bubble, and various hyperinflationary episodes. In every case, the pattern remained consistent: more capable models performed worse at long-range forecasting because they were more prone to "over-committing" to a growth trend that was destined to fail.
Why Standard Metrics Mask the Failure
A critical finding of the paper is that this failure is often invisible to standard evaluation methods. Most existing benchmarks use "single-threshold" metrics, which essentially ask the model a binary question (e.g., "Will the value be above X?"). Under these metrics, more capable models often appear to be improving. However, when the researchers used continuous, "tail-inclusive" scoring rules—which evaluate the entire probability distribution of a forecast—the relationship between capability and accuracy reversed. This suggests that current industry-standard benchmarks may be failing to detect how models behave during high-stakes, volatile events.
The Limits of Domain Knowledge
The researchers also explored whether simply telling a model what it is forecasting (e.g., identifying the data as a disease outbreak or an economic crisis) would fix the problem. They found that domain knowledge has an inconsistent effect. While it helped in some cases, it failed to stop the over-extrapolation in others, such as hyperinflation. Even when models were able to correctly identify that a crisis was occurring, they often "weighed and discarded" that information, choosing to prioritize the aggressive growth trend in their final forecast. The authors conclude that scale and post-training alone are unlikely to solve this, and they recommend that future evaluations must include continuous, unbounded measures of accuracy to ensure models are truly calibrated for the real world.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!