Can I Buy Your KV Cache?
Currently, AI agents waste massive amounts of computing power by repeatedly performing the same "prefill" step—the process of reading a document and building a key-value (KV) cache—every time they interact with the same content. This paper proposes a simple, efficient alternative: compute the KV cache once, store it as a reusable artifact, and allow other agents to "buy" the right to load that cache instead of recomputing it from scratch. By treating the KV cache as a shareable, first-class asset, this approach creates a "prefill CDN" that could save millions of dollars in redundant computation.
The Logic of Reuse
The core idea is to shift from a model where every agent performs its own prefill to one where a publisher provides a precomputed KV cache. Because the KV cache is a deterministic result of the input text, loading a precomputed cache and continuing the conversation is "token-exact"—meaning the output is identical to what would have been generated if the agent had performed the prefill itself. This ensures that the cost savings come with zero loss in accuracy.
Why It Is Economically Efficient
The paper demonstrates that the cost of reusing a cache is significantly lower than the cost of re-prefilling. Using the Qwen3-4B model, the researchers found that reusing a cache is between 9 and 50 times cheaper than performing a full prefill. Crucially, this advantage grows as the document length increases, because the compute cost of prefill scales quadratically with length, while the cost of loading a cache remains much more stable. The break-even point is almost immediate; the investment in precomputing the cache pays for itself by the second read.
The Importance of Hosting
A major challenge is where to store these caches. Shipping large KV caches to individual users is inefficient because they are nearly incompressible, and the cost of transferring the data often exceeds the cost of simply re-prefilling. The paper argues that the solution is to host these caches provider-side, similar to existing production prompt-caching. By keeping the cache within the provider’s infrastructure, the system eliminates expensive data egress costs, allowing providers to offer significant discounts to users while still maintaining a healthy profit margin.
Limitations and Future Work
While the current method is highly effective for shared prefixes, it does not yet support "fusing" caches from different, independently prefilled document fragments without some loss of accuracy. Additionally, while compressing the cache (such as through quantization) could reduce storage needs, the paper notes that uniform quantization can break the bit-exactness of the results. Future developments in lossless KV compression and the creation of a cross-party payment layer are identified as the necessary next steps to fully realize this agent-native prefill CDN.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!