Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
Multi-agent systems—where multiple AI agents collaborate to solve complex problems—have become a standard way to improve reasoning in large language models. Typically, these agents communicate by writing text to one another, which acts as a fixed interface. This paper introduces DiffMAS, a new framework that replaces this text-based communication with a "latent" (internal) channel. By allowing agents to share their raw internal memory, known as Key-Value (KV) caches, DiffMAS enables the entire multi-agent system to be trained as a single, unified, and differentiable model.
Moving Beyond Textual Communication
In most multi-agent systems, agents must translate their internal reasoning into human-readable text to pass information to the next agent. This process creates a bottleneck: the system cannot easily optimize how information is shared because the "message" is forced into a discrete, textual format. DiffMAS removes this barrier by using the model’s internal KV cache—the continuous mathematical representation of the model's "thought process"—as the communication medium. Because this process is continuous and differentiable, the system can use gradient-based learning to optimize how agents encode and interpret information across the entire chain of interaction.
How DiffMAS Works
The DiffMAS framework operates in two distinct stages. In the first stage, a series of agents work sequentially, each building upon a shared, growing "latent trace" of KV states. Instead of overwriting previous information, each agent appends its own contribution to this trace, creating a cumulative record of the reasoning process. In the second stage, the final agent uses this accumulated trace to perform the final reasoning and generate an answer. By applying supervised fine-tuning to this process, the model learns to refine its communication strategy, effectively teaching the agents how to better "talk" to each other through their internal memory.
Performance and Results
The researchers tested DiffMAS across a variety of challenging tasks, including advanced mathematics (AIME24/25), scientific reasoning (GPQA-Diamond), code generation (HumanEval+), and commonsense benchmarks. The results show that DiffMAS consistently outperforms traditional text-based multi-agent systems and other latent communication methods. Notably, the framework achieved significant accuracy gains, such as a 26.7% improvement on the AIME24 math benchmark and a 20.2% boost on GPQA-Diamond for the Qwen3-8B model. These gains were observed across various model sizes, demonstrating that the approach is effective for both mid-scale and larger language models.
Key Takeaways
The primary advantage of DiffMAS is its ability to avoid the "gradient attenuation" found in systems that rely on fixed, overwriting interfaces. In traditional setups, the signal used to train the system often fades as it passes through multiple agents. Because DiffMAS uses a concatenative approach—where all intermediate reasoning remains accessible—the training signal remains strong throughout the entire pipeline. While this method requires more memory as the latent trace grows, it provides a more robust way to optimize multi-agent collaboration, moving the field closer to truly end-to-end, learnable reasoning systems.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!