RedKnot is a new system designed to make Large Language Model (LLM) serving faster and more efficient by changing how the KV cache—a memory-intensive component of LLMs—is managed. Current systems treat the KV cache as a single, monolithic block, which leads to wasted memory and redundant computations. RedKnot instead breaks this cache down by individual attention heads, allowing the system to apply different strategies to different parts of the model. This approach significantly reduces the time it takes to generate the first token (TTFT) and increases the number of concurrent users a single GPU can support.
Moving Beyond Monolithic Caching
Traditional LLM serving systems treat the KV cache as a uniform sequence of data, applying the same management policies to every part of the model. However, research shows that different "heads" within an LLM have different roles and importance levels. Some heads require full access to the entire context, while others only need to focus on recent information. By decomposing the cache along these head dimensions, RedKnot avoids the "one-size-fits-all" trap, allowing the system to reuse cache data more intelligently without sacrificing the accuracy of the model's output.
Head-Aware Optimization
RedKnot uses a strategy called "head-class sparsification" to categorize every layer and head as either "global" or "local." Global heads (roughly 12–15% of the total) are recomputed when necessary to ensure high fidelity, while local heads (roughly 85–88%) are reused verbatim from previous sessions. To handle the storage of this data, the system employs "SegPagedAttention," a custom memory management design that keeps only the necessary tokens for each head. This allows the system to stay on high-performance computation paths (like FlashAttention) without the overhead of complex masking, which often slows down traditional systems.
Addressing the FFN Bottleneck
A major insight in this research is that for many common tasks, such as agentic workflows or shorter prompts, the bottleneck isn't just the attention mechanism—it is the Feed-Forward Network (FFN). RedKnot addresses this by implementing token-selective FFN recovery. By identifying and evaluating only the most important tokens for FFN calculations, the system can skip unnecessary operations. Because this optimization is independent of the KV cache, it works in tandem with head-aware attention improvements to provide speedups across a wide variety of context lengths.
Performance and Results
By aligning the recovery, compute, and storage granularities, RedKnot achieves significant improvements over standard dense attention methods. In evaluations using models like Llama-3.3-70B and Qwen3-32B, the system delivered between 1.6x and 3.5x faster time-to-first-token and supported 4.7x to 7.8x more concurrent sessions per GPU. Furthermore, it reduced the total number of floating-point operations (FLOPs) required for prefill by 67% to 79.5%, all while maintaining output quality comparable to the dense baseline. These gains are achieved without requiring any model retraining or fine-tuning.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!