Back to AI Research

AI Research

RedKnot: Efficient Long-Context LLM Serving with He... | AI Research

Key Takeaways

  • RedKnot is a new system designed to make Large Language Model (LLM) serving faster and more efficient by changing how the KV cache—a memory-intensive compone...
  • As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure.
  • It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability.
  • We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance.
  • Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario.
Paper AbstractExpand

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

RedKnot is a new system designed to make Large Language Model (LLM) serving faster and more efficient by changing how the KV cache—a memory-intensive component of LLMs—is managed. Current systems treat the KV cache as a single, monolithic block, which leads to wasted memory and redundant computations. RedKnot instead breaks this cache down by individual attention heads, allowing the system to apply different strategies to different parts of the model. This approach significantly reduces the time it takes to generate the first token (TTFT) and increases the number of concurrent users a single GPU can support.

Moving Beyond Monolithic Caching

Traditional LLM serving systems treat the KV cache as a uniform sequence of data, applying the same management policies to every part of the model. However, research shows that different "heads" within an LLM have different roles and importance levels. Some heads require full access to the entire context, while others only need to focus on recent information. By decomposing the cache along these head dimensions, RedKnot avoids the "one-size-fits-all" trap, allowing the system to reuse cache data more intelligently without sacrificing the accuracy of the model's output.

Head-Aware Optimization

RedKnot uses a strategy called "head-class sparsification" to categorize every layer and head as either "global" or "local." Global heads (roughly 12–15% of the total) are recomputed when necessary to ensure high fidelity, while local heads (roughly 85–88%) are reused verbatim from previous sessions. To handle the storage of this data, the system employs "SegPagedAttention," a custom memory management design that keeps only the necessary tokens for each head. This allows the system to stay on high-performance computation paths (like FlashAttention) without the overhead of complex masking, which often slows down traditional systems.

Addressing the FFN Bottleneck

A major insight in this research is that for many common tasks, such as agentic workflows or shorter prompts, the bottleneck isn't just the attention mechanism—it is the Feed-Forward Network (FFN). RedKnot addresses this by implementing token-selective FFN recovery. By identifying and evaluating only the most important tokens for FFN calculations, the system can skip unnecessary operations. Because this optimization is independent of the KV cache, it works in tandem with head-aware attention improvements to provide speedups across a wide variety of context lengths.

Performance and Results

By aligning the recovery, compute, and storage granularities, RedKnot achieves significant improvements over standard dense attention methods. In evaluations using models like Llama-3.3-70B and Qwen3-32B, the system delivered between 1.6x and 3.5x faster time-to-first-token and supported 4.7x to 7.8x more concurrent sessions per GPU. Furthermore, it reduced the total number of floating-point operations (FLOPs) required for prefill by 67% to 79.5%, all while maintaining output quality comparable to the dense baseline. These gains are achieved without requiring any model retraining or fine-tuning.

Comments (0)

No comments yet

Be the first to share your thoughts!