Multi-agent debate is a powerful technique where multiple AI models critique and refine each other’s reasoning to improve accuracy and reduce hallucinations. However, this process is computationally expensive because it requires generating long, multi-turn transcripts before an answer can be reached. This paper introduces Internalized Multi-Agent Debate (IMAD), a framework that distills this complex, multi-agent process into a single, efficient language model. By training a model to "internalize" the debate, the researchers achieve the reasoning benefits of multi-agent collaboration while using up to 93% fewer tokens.
How the Internalization Process Works
The IMAD framework uses a two-stage fine-tuning pipeline to teach a single model how to simulate a debate internally. First, the model undergoes supervised fine-tuning on a dataset of structured debate logs, learning the format of multi-agent interaction, such as proposing arguments and reaching a consensus.
Second, the model is optimized using reinforcement learning with two specific mechanisms: a "formatting reward" that encourages the model to follow the debate structure, and a "correctness with length-clipping reward." The latter forces the model to produce the correct answer within a shrinking token limit. As training progresses, the model is incentivized to move the entire debate process into its "latent space"—performing the multi-perspective analysis internally rather than writing it out—until it can generate the final answer directly and efficiently.
Discovering Agent Subspaces
Beyond efficiency, the researchers investigated whether the model actually maintains distinct "agent" perspectives after the debate is internalized. By using a technique called activation steering, they identified "agent-specific subspaces"—interpretable directions within the model's internal activation space that correspond to different reasoning styles.
When the researchers applied steering vectors to these subspaces, they found that the model could be nudged to adopt the specific behaviors of the agents it was trained to simulate. This confirms that the collaborative structure of the debate is not lost or collapsed during distillation; instead, it is preserved as a structured, recoverable part of the model’s internal representation.
Controlling Malicious Behaviors
The researchers also demonstrated a practical safety application for these agent subspaces. By training an IMAD model that included a "malicious" agent—one instructed to hallucinate or act harmfully—they showed that the resulting harmful behavior was localized within a specific, identifiable subspace.
Because this behavior was confined to a distinct direction in the model's activation space, it could be suppressed using "negative steering." This approach proved more effective at controlling harmful traits than steering a standard base model, and it achieved this control with a smaller reduction in the model's overall task performance. This suggests that internalizing reasoning behaviors provides a new, more precise way to monitor and regulate AI safety.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!