Multiagent Protocols with Aggregated Confidence Signals
This research addresses a critical gap in how multiagent systems—where multiple AI models collaborate to solve a problem—evaluate their own reliability. While individual AI models can estimate how confident they are in an answer, there has been no standard way to combine these individual confidence signals into a single, trustworthy score for the entire system. This paper introduces three new protocols that transform individual model confidence into a unified system-level confidence, allowing multiagent systems to provide more reliable and discriminative outputs.
Bridging the Confidence Gap
Current multiagent debate (MAD) methods often struggle because they lack oversight; models may change correct answers to incorrect ones during the debate process. Furthermore, confidence signals from different models are not naturally comparable, as each model has its own internal scale. The authors propose protocols that first normalize these raw confidence signals using learned transformations or thresholds. By making these signals comparable, the system can effectively weigh different perspectives or decide when to trust one agent over another.
Three New Protocols
The researchers developed three distinct ways to aggregate confidence:
Weighted Stream Voting (WSV): This method treats every response from every agent as a piece of evidence. It assigns a learned weight to each source, calculating a final score by summing the weighted confidence of all agents that agree on a specific answer.
Confidence Gating with Aggregation (CGA): This two-step process first uses a "gate" to decide if an agent should keep its initial answer or switch to a debated one based on whether its confidence improves. It then uses a mathematical technique called Bayesian fusion to combine the surviving answers.
Human-Inspired Debate (HID): This deterministic approach mimics human collaboration. It routes answers based on agreement and confidence levels, using high-confidence agents to "convince" low-confidence ones and triggering debates only when necessary.
Improving Reliability and Discrimination
The study evaluated these protocols across five benchmarks and four task types, including scientific QA, stance detection, and hate speech detection. The results show that these protocols produce an aggregated confidence score that is significantly more discriminative—meaning it is better at distinguishing between correct and incorrect answers—than standard debate baselines or single-agent setups. While standard debate methods often suffer from performance drops on ambiguous tasks, these new protocols help the system maintain stable accuracy and even recover performance losses.
The Role of Calibration
A key finding of the research is that post-hoc calibration is essential for these systems to function correctly. The authors tested both parametric (like Beta calibration) and non-parametric (like Isotonic regression) methods to adjust confidence scores. They discovered that while calibration is vital for improving the overall accuracy (F1-score) of the system, the ability of the system to rank its answers by confidence (discrimination) is less dependent on this step. Ultimately, the research demonstrates that by carefully managing confidence signals, multiagent systems can become more reliable tools for complex reasoning tasks.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!