Back to AI Research

AI Research

Latent Agents: A Post-Training Procedure for Intern... | AI Research

Key Takeaways

  • Multi-agent debate is a powerful technique where multiple AI models critique and refine each other’s reasoning to improve accuracy and reduce hallucinations....
  • Multi-agent debate has been shown to improve reasoning in large language models (LLMs).
  • However, it is compute-intensive, requiring generation of long transcripts before answering questions.
  • Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens.
  • Our findings offer a new perspective for understanding multi-agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors.
Paper AbstractExpand

Multi-agent debate has been shown to improve reasoning in large language models (LLMs). However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. Across multiple models and benchmarks, our internalized models match or exceed explicit multi-agent debate performance using up to 93% fewer tokens. We then investigate the mechanistic basis of this capability through activation steering, finding that internalization creates agent-specific subspaces: interpretable directions in activation space corresponding to different agent perspectives. We further demonstrate a practical application: by instilling malicious agents into the LLM through internalized debate, then applying negative steering to suppress them, we show that distillation makes harmful behaviors easier to localize and control with smaller reductions in general performance compared to steering base models. Our findings offer a new perspective for understanding multi-agent capabilities in distilled models and provide practical guidelines for controlling internalized reasoning behaviors. Code available at this https URL

Multi-agent debate is a powerful technique where multiple AI models critique and refine each other’s reasoning to improve accuracy and reduce hallucinations. However, this process is computationally expensive because it requires generating long, multi-turn transcripts before an answer can be reached. This paper introduces Internalized Multi-Agent Debate (IMAD), a framework that distills this complex, multi-agent process into a single, efficient language model. By training a model to "internalize" the debate, the researchers achieve the reasoning benefits of multi-agent collaboration while using up to 93% fewer tokens.

How the Internalization Process Works

The IMAD framework uses a two-stage fine-tuning pipeline to teach a single model how to simulate a debate internally. First, the model undergoes supervised fine-tuning on a dataset of structured debate logs, learning the format of multi-agent interaction, such as proposing arguments and reaching a consensus.
Second, the model is optimized using reinforcement learning with two specific mechanisms: a "formatting reward" that encourages the model to follow the debate structure, and a "correctness with length-clipping reward." The latter forces the model to produce the correct answer within a shrinking token limit. As training progresses, the model is incentivized to move the entire debate process into its "latent space"—performing the multi-perspective analysis internally rather than writing it out—until it can generate the final answer directly and efficiently.

Discovering Agent Subspaces

Beyond efficiency, the researchers investigated whether the model actually maintains distinct "agent" perspectives after the debate is internalized. By using a technique called activation steering, they identified "agent-specific subspaces"—interpretable directions within the model's internal activation space that correspond to different reasoning styles.
When the researchers applied steering vectors to these subspaces, they found that the model could be nudged to adopt the specific behaviors of the agents it was trained to simulate. This confirms that the collaborative structure of the debate is not lost or collapsed during distillation; instead, it is preserved as a structured, recoverable part of the model’s internal representation.

Controlling Malicious Behaviors

The researchers also demonstrated a practical safety application for these agent subspaces. By training an IMAD model that included a "malicious" agent—one instructed to hallucinate or act harmfully—they showed that the resulting harmful behavior was localized within a specific, identifiable subspace.
Because this behavior was confined to a distinct direction in the model's activation space, it could be suppressed using "negative steering." This approach proved more effective at controlling harmful traits than steering a standard base model, and it achieved this control with a smaller reduction in the model's overall task performance. This suggests that internalizing reasoning behaviors provides a new, more precise way to monitor and regulate AI safety.

Comments (0)

No comments yet

Be the first to share your thoughts!