TriAttention: 2.5x Faster LLM Throughput via KV Cache Compression

Key Takeaways

  • Enables complex long-chain reasoning on consumer hardware by reducing KV cache memory usage by up to 10.7×.
  • Achieves a 2.5× increase in throughput while maintaining accuracy levels comparable to full attention models.
  • Solves the 'retrieval head' eviction problem, allowing models to retain critical information over long sequences without performance degradation.

Researchers from MIT, NVIDIA, and Zhejiang University have introduced TriAttention, a novel KV cache compression method designed to maintain the performance of full attention while significantly improving efficiency. By addressing the limitations of existing compression techniques, TriAttention achieves a 2.5× increase in throughput and up to 10.7× reduction in memory usage, enabling complex long-chain reasoning tasks to run on more accessible hardware.

The Limitation of Post-RoPE Compression

Standard KV cache compression methods, such as SnapKV, H2O, and R-KV, typically function by identifying and retaining important tokens based on attention scores. These methods operate in the post-RoPE space, where Rotary Position Embeddings rotate query and key vectors based on their position. Because this rotation causes query vectors to change orientation depending on their location in a sequence, the effective window for estimating token importance is extremely narrow, often peaking at only 25 queries.
This narrow observation window leads to the premature eviction of critical tokens, particularly those used by retrieval heads. In long-chain reasoning tasks, these tokens may remain dormant for thousands of steps before becoming essential. When they are evicted due to their lack of recent attention, the model loses the ability to recall necessary information, resulting in a breakdown of the reasoning process.

Leveraging Pre-RoPE Q/K Concentration

The research team discovered that Query and Key vectors exhibit a consistent property in the pre-RoPE space: they cluster tightly around fixed, non-zero center points. This phenomenon, termed Q/K concentration, is an intrinsic characteristic of the model's learned weights rather than a result of specific inputs. By measuring this using the Mean Resultant Length, researchers found that approximately 90% of attention heads in models like Qwen3-8B exhibit high concentration, a trait that remains stable across different domains, including math, coding, and chat.
Because these centers remain fixed in the pre-RoPE space, the attention logit can be mathematically simplified into a trigonometric series that depends solely on the positional distance between queries and keys. This allows TriAttention to score keys offline using calibration data, eliminating the need for live query observations. The method combines a trigonometric series score with a norm-based score to adaptively retain the most salient tokens, ensuring that critical intermediate states are preserved during complex tasks.

Performance and Generalization

Experimental results demonstrate that TriAttention maintains high accuracy in demanding environments. On the AIME25 mathematical reasoning benchmark, the method matches full attention accuracy while significantly outperforming existing baselines. Furthermore, the approach shows robust performance in recursive tasks, where it avoids the catastrophic accuracy degradation seen in other methods when intermediate reasoning states are required.
Beyond mathematical benchmarks, TriAttention proves effective across a variety of general language tasks. On the LongBench benchmark, it outperformed other compression methods, winning 11 out of 16 subtasks. The method also enables a 32B reasoning model to operate on a single 24GB RTX 4090, providing a practical solution for deploying large-scale models on consumer hardware.

Comments (0)

No comments yet

Be the first to share your thoughts!