Back to AI Research

AI Research

Multiagent Protocols with Aggregated Confidence Sig... | AI Research

Key Takeaways

  • Multiagent Protocols with Aggregated Confidence Signals This research addresses a critical gap in how multiagent systems—where multiple AI models collaborate...
  • Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself.
  • Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it.
  • We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.
  • Multiagent Protocols with Aggregated Confidence Signals This research addresses a critical gap in how multiagent systems—where multiple AI models collaborate to solve a problem—evaluate their own reliability.
Paper AbstractExpand

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

Multiagent Protocols with Aggregated Confidence Signals
This research addresses a critical gap in how multiagent systems—where multiple AI models collaborate to solve a problem—evaluate their own reliability. While individual AI models can estimate how confident they are in an answer, there has been no standard way to combine these individual confidence signals into a single, trustworthy score for the entire system. This paper introduces three new protocols that transform individual model confidence into a unified system-level confidence, allowing multiagent systems to provide more reliable and discriminative outputs.

Bridging the Confidence Gap

Current multiagent debate (MAD) methods often struggle because they lack oversight; models may change correct answers to incorrect ones during the debate process. Furthermore, confidence signals from different models are not naturally comparable, as each model has its own internal scale. The authors propose protocols that first normalize these raw confidence signals using learned transformations or thresholds. By making these signals comparable, the system can effectively weigh different perspectives or decide when to trust one agent over another.

Three New Protocols

The researchers developed three distinct ways to aggregate confidence:

  • Weighted Stream Voting (WSV): This method treats every response from every agent as a piece of evidence. It assigns a learned weight to each source, calculating a final score by summing the weighted confidence of all agents that agree on a specific answer.

  • Confidence Gating with Aggregation (CGA): This two-step process first uses a "gate" to decide if an agent should keep its initial answer or switch to a debated one based on whether its confidence improves. It then uses a mathematical technique called Bayesian fusion to combine the surviving answers.

  • Human-Inspired Debate (HID): This deterministic approach mimics human collaboration. It routes answers based on agreement and confidence levels, using high-confidence agents to "convince" low-confidence ones and triggering debates only when necessary.

Improving Reliability and Discrimination

The study evaluated these protocols across five benchmarks and four task types, including scientific QA, stance detection, and hate speech detection. The results show that these protocols produce an aggregated confidence score that is significantly more discriminative—meaning it is better at distinguishing between correct and incorrect answers—than standard debate baselines or single-agent setups. While standard debate methods often suffer from performance drops on ambiguous tasks, these new protocols help the system maintain stable accuracy and even recover performance losses.

The Role of Calibration

A key finding of the research is that post-hoc calibration is essential for these systems to function correctly. The authors tested both parametric (like Beta calibration) and non-parametric (like Isotonic regression) methods to adjust confidence scores. They discovered that while calibration is vital for improving the overall accuracy (F1-score) of the system, the ability of the system to rank its answers by confidence (discrimination) is less dependent on this step. Ultimately, the research demonstrates that by carefully managing confidence signals, multiagent systems can become more reliable tools for complex reasoning tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!