When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
This paper investigates the effectiveness of combining multiple Large Language Models (LLMs) through techniques like routing, voting, and mixture-of-agents. While these methods are commonly used to improve performance, the author demonstrates that their potential for success is strictly limited by a "co-failure ceiling." This ceiling is determined by how often all models in a system fail on the same query, a metric that is frequently overlooked in favor of less informative diagnostics.
The Problem with Current Diagnostics
Practitioners typically use the average pairwise error correlation ($\rho$) to determine if a group of models will work well together. The paper proves that this metric is insufficient because it cannot identify the rate at which all models fail simultaneously ($\beta$). Different error patterns can share the same pairwise correlation while having vastly different rates of total system failure. Consequently, relying on $\rho$ can lead to a misunderstanding of how much performance gain is actually possible when combining models.
The Co-Failure Ceiling
The research establishes that for any system where the output is chosen from the member models, the maximum possible accuracy is $1 - \beta$. This means that the "oracle gain"—the potential improvement over the single best model—is limited by the frequency of these collective failures. The author provides a "Clopper-Pearson" certificate, a method that allows developers to use a small set of graded queries to calculate the maximum possible gain they could achieve before investing time and resources into building a complex routing system.
Findings Across 67 Models
By analyzing 67 frontier models from 21 different providers, the study found that the "all-models-wrong" rate is significantly higher than what standard statistical models (like the Gaussian copula) predict. Even when using sophisticated calibration, these models underprice the tail of collective failures by roughly 2.5 times in open-ended mathematics. The research also highlights that adding more models to a pool does not necessarily lead to better results; instead, performance gains are driven by models that fail on different types of questions.
Practical Takeaways
The study concludes that on many checkable tasks, combining models rarely outperforms the single best model unless there is a very strong, query-level routing signal. The "realizable" gain from routing is often near zero, not because the routing technology is weak, but because the prompt itself often lacks the information needed to predict which model will succeed when the frontier models disagree. Ultimately, the paper suggests that developers should focus on identifying the co-failure rate of their model pool to determine if orchestration is worth the effort, rather than assuming that diversity in a model pool will automatically translate into higher accuracy.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!