Back to AI Research

AI Research

When Does Combining Language Models Help? A Co-Fail... | AI Research

Key Takeaways

  • When Does Combining Language Models Help?
  • A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models This paper investigates th...
  • Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy.
  • We show that their gain is capped by a quantity the field rarely reports.
  • For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query.
Paper AbstractExpand

Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.

When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
This paper investigates the effectiveness of combining multiple Large Language Models (LLMs) through techniques like routing, voting, and mixture-of-agents. While these methods are commonly used to improve performance, the author demonstrates that their potential for success is strictly limited by a "co-failure ceiling." This ceiling is determined by how often all models in a system fail on the same query, a metric that is frequently overlooked in favor of less informative diagnostics.

The Problem with Current Diagnostics

Practitioners typically use the average pairwise error correlation ($\rho$) to determine if a group of models will work well together. The paper proves that this metric is insufficient because it cannot identify the rate at which all models fail simultaneously ($\beta$). Different error patterns can share the same pairwise correlation while having vastly different rates of total system failure. Consequently, relying on $\rho$ can lead to a misunderstanding of how much performance gain is actually possible when combining models.

The Co-Failure Ceiling

The research establishes that for any system where the output is chosen from the member models, the maximum possible accuracy is $1 - \beta$. This means that the "oracle gain"—the potential improvement over the single best model—is limited by the frequency of these collective failures. The author provides a "Clopper-Pearson" certificate, a method that allows developers to use a small set of graded queries to calculate the maximum possible gain they could achieve before investing time and resources into building a complex routing system.

Findings Across 67 Models

By analyzing 67 frontier models from 21 different providers, the study found that the "all-models-wrong" rate is significantly higher than what standard statistical models (like the Gaussian copula) predict. Even when using sophisticated calibration, these models underprice the tail of collective failures by roughly 2.5 times in open-ended mathematics. The research also highlights that adding more models to a pool does not necessarily lead to better results; instead, performance gains are driven by models that fail on different types of questions.

Practical Takeaways

The study concludes that on many checkable tasks, combining models rarely outperforms the single best model unless there is a very strong, query-level routing signal. The "realizable" gain from routing is often near zero, not because the routing technology is weak, but because the prompt itself often lacks the information needed to predict which model will succeed when the frontier models disagree. Ultimately, the paper suggests that developers should focus on identifying the co-failure rate of their model pool to determine if orchestration is worth the effort, rather than assuming that diversity in a model pool will automatically translate into higher accuracy.

Comments (0)

No comments yet

Be the first to share your thoughts!