Can I Buy Your KV Cache?

Can I Buy Your KV Cache? | AI Research

Key Takeaways

Currently, AI agents waste massive amounts of computing power by repeatedly performing the same "prefill" step—the process of readin...
Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch.
Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built.
The same answer, computed a million times.
We make a proposal that is almost offensively simple: compute it once.

Paper AbstractExpand

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

Can I Buy Your KV Cache?
Currently, AI agents waste massive amounts of computing power by repeatedly performing the same "prefill" step—the process of reading a document and building a key-value (KV) cache—every time they interact with the same content. This paper proposes a simple, efficient alternative: compute the KV cache once, store it as a reusable artifact, and allow other agents to "buy" the right to load that cache instead of recomputing it from scratch. By treating the KV cache as a shareable, first-class asset, this approach creates a "prefill CDN" that could save millions of dollars in redundant computation.

The Logic of Reuse

The core idea is to shift from a model where every agent performs its own prefill to one where a publisher provides a precomputed KV cache. Because the KV cache is a deterministic result of the input text, loading a precomputed cache and continuing the conversation is "token-exact"—meaning the output is identical to what would have been generated if the agent had performed the prefill itself. This ensures that the cost savings come with zero loss in accuracy.

Why It Is Economically Efficient

The paper demonstrates that the cost of reusing a cache is significantly lower than the cost of re-prefilling. Using the Qwen3-4B model, the researchers found that reusing a cache is between 9 and 50 times cheaper than performing a full prefill. Crucially, this advantage grows as the document length increases, because the compute cost of prefill scales quadratically with length, while the cost of loading a cache remains much more stable. The break-even point is almost immediate; the investment in precomputing the cache pays for itself by the second read.

The Importance of Hosting

A major challenge is where to store these caches. Shipping large KV caches to individual users is inefficient because they are nearly incompressible, and the cost of transferring the data often exceeds the cost of simply re-prefilling. The paper argues that the solution is to host these caches provider-side, similar to existing production prompt-caching. By keeping the cache within the provider’s infrastructure, the system eliminates expensive data egress costs, allowing providers to offer significant discounts to users while still maintaining a healthy profit margin.

Limitations and Future Work

While the current method is highly effective for shared prefixes, it does not yet support "fusing" caches from different, independently prefilled document fragments without some loss of accuracy. Additionally, while compressing the cache (such as through quantization) could reduce storage needs, the paper notes that uniform quantization can break the bit-exactness of the results. Future developments in lossless KV compression and the creation of a cross-party payment layer are identified as the necessary next steps to fully realize this agent-native prefill CDN.