Towards Direct Latent-Space Synthesis for Parallel...

Large language models (LLMs) are increasingly used as "engines" for complex agentic systems that break down tasks into parallel branches—such as exploring different solutions or gathering evidence from multiple sources. Currently, these systems merge these branches by converting them into plain text and concatenating them into a single, long prompt. This process is inefficient because it forces the model to re-process information it has already generated, and it obscures the original structure of the task. This paper introduces Parallel-Synthesis, a framework that allows a "synthesizer" agent to directly consume the internal latent states (KV caches) of parallel worker agents, bypassing the need for redundant text processing.

A New Interface for Agent Workflows

Parallel-Synthesis replaces the standard text-based communication between agents with a direct latent-space interface. In a typical workflow, multiple worker agents generate outputs independently. Instead of turning these outputs into text, the system captures the "KV cache"—the internal mathematical representation of the generated tokens—from each worker. By allowing the synthesizer to ingest these caches directly, the framework preserves the independent structure of the branches and eliminates the "prefill" computation cost associated with re-reading concatenated text.

How the Framework Works

Because these worker caches are generated independently, they cannot be simply stacked together; they require calibration to be understood by the synthesizer. Parallel-Synthesis uses three key technical components to make this possible:

Positional Re-encoding: It adjusts the internal position markers of each worker's cache so that they all appear to the synthesizer as if they originated from the same branching point in the workflow.
Cache Mapping: A learnable "mapper" uses an MLP to predict specific adjustments for each cache, calibrating the keys and values so the synthesizer can interpret them as a unified context.
Synthesizer LoRA: The system uses a fine-tuned adapter (LoRA) that teaches the synthesizer how to reason over and aggregate information from these non-sequential, parallel cache inputs.

Performance and Efficiency

The researchers tested Parallel-Synthesis across nine diverse datasets, including math, science, code generation, and multi-agent database diagnosis. The results show that the framework is highly effective:

Quality: It matches or outperforms traditional text-based synthesis on seven out of nine datasets, with only minor performance gaps on the remaining two.
Speed: By avoiding the need to re-process (re-prefill) the worker outputs as text, the system achieves a 2.5x to 11x reduction in time-to-first-token (TTFT).

Key Considerations

The authors note that this approach is distinct from existing RAG-style cache reuse. While RAG methods often struggle with unresolved dependencies between document chunks, Parallel-Synthesis is designed specifically for agentic workflows where worker outputs are more coherent, such as complete candidate solutions or subtask results. The framework is designed as a "plug-and-play" solution, meaning the synthesizer-side adapter is only activated when parallel synthesis is required, leaving the underlying worker-side execution unchanged.

Towards Direct Latent-Space Synthesis for Parallel... | AI Research

Key Takeaways

A New Interface for Agent Workflows

How the Framework Works

Performance and Efficiency

Key Considerations

Comments (0)

No comments yet