Back to AI Research

AI Research

Towards Direct Latent-Space Synthesis for Parallel... | AI Research

Key Takeaways

  • Large language models (LLMs) are increasingly used as "engines" for complex agentic systems that break down tasks into parallel branches—such as exploring di...
  • Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface.
  • This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step.
  • Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation.
  • In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents.
Paper AbstractExpand

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

Large language models (LLMs) are increasingly used as "engines" for complex agentic systems that break down tasks into parallel branches—such as exploring different solutions or gathering evidence from multiple sources. Currently, these systems merge these branches by converting them into plain text and concatenating them into a single, long prompt. This process is inefficient because it forces the model to re-process information it has already generated, and it obscures the original structure of the task. This paper introduces Parallel-Synthesis, a framework that allows a "synthesizer" agent to directly consume the internal latent states (KV caches) of parallel worker agents, bypassing the need for redundant text processing.

A New Interface for Agent Workflows

Parallel-Synthesis replaces the standard text-based communication between agents with a direct latent-space interface. In a typical workflow, multiple worker agents generate outputs independently. Instead of turning these outputs into text, the system captures the "KV cache"—the internal mathematical representation of the generated tokens—from each worker. By allowing the synthesizer to ingest these caches directly, the framework preserves the independent structure of the branches and eliminates the "prefill" computation cost associated with re-reading concatenated text.

How the Framework Works

Because these worker caches are generated independently, they cannot be simply stacked together; they require calibration to be understood by the synthesizer. Parallel-Synthesis uses three key technical components to make this possible:

  • Positional Re-encoding: It adjusts the internal position markers of each worker's cache so that they all appear to the synthesizer as if they originated from the same branching point in the workflow.

  • Cache Mapping: A learnable "mapper" uses an MLP to predict specific adjustments for each cache, calibrating the keys and values so the synthesizer can interpret them as a unified context.

  • Synthesizer LoRA: The system uses a fine-tuned adapter (LoRA) that teaches the synthesizer how to reason over and aggregate information from these non-sequential, parallel cache inputs.

Performance and Efficiency

The researchers tested Parallel-Synthesis across nine diverse datasets, including math, science, code generation, and multi-agent database diagnosis. The results show that the framework is highly effective:

  • Quality: It matches or outperforms traditional text-based synthesis on seven out of nine datasets, with only minor performance gaps on the remaining two.

  • Speed: By avoiding the need to re-process (re-prefill) the worker outputs as text, the system achieves a 2.5x to 11x reduction in time-to-first-token (TTFT).

Key Considerations

The authors note that this approach is distinct from existing RAG-style cache reuse. While RAG methods often struggle with unresolved dependencies between document chunks, Parallel-Synthesis is designed specifically for agentic workflows where worker outputs are more coherent, such as complete candidate solutions or subtask results. The framework is designed as a "plug-and-play" solution, meaning the synthesizer-side adapter is only activated when parallel synthesis is required, leaving the underlying worker-side execution unchanged.

Comments (0)

No comments yet

Be the first to share your thoughts!