Back to AI Research

AI Research

ReasonAlloc: Hierarchical Decoding-Time KV Cache Bu... | AI Research

Key Takeaways

  • ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models addresses the memory bottleneck caused by the massive key-value (KV)...
  • Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth.
  • Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads.
  • In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning.
  • To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem.
Paper AbstractExpand

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models addresses the memory bottleneck caused by the massive key-value (KV) caches required by modern reasoning models. As these models generate long chains of thought, the cache grows rapidly, limiting inference speed and batch size. While existing methods compress the cache by evicting tokens, they often distribute the memory budget uniformly across all layers and heads, which fails to account for the specific, shifting needs of reasoning tasks. ReasonAlloc introduces a training-free, hierarchical framework that intelligently allocates memory based on the actual importance of different layers and heads during the decoding process.

The "Reasoning Wave" and Layer-Wise Allocation

The researchers discovered that reasoning models exhibit a consistent, architecture-driven pattern of memory demand, which they term the "Reasoning Wave." Contrary to the assumption that memory needs should simply decrease toward the end of a model, this pattern shows that shallow layers require significant memory for context, middle layers need less as they perform specific logical deductions, and deep layers experience a sudden surge in demand to verify the final output. ReasonAlloc uses an offline calibration process to preallocate budgets to layers based on this stable, architecture-specific wave, ensuring that critical reasoning pathways are not starved of memory.

Dynamic Head-Wise Routing

Beyond the layer-level, the importance of individual attention heads fluctuates significantly as the model generates text. ReasonAlloc manages this through an online routing strategy that refreshes periodically during decoding. By analyzing the "utility" of tokens in real-time, the system dynamically shifts memory budgets toward heads that are currently processing high-value information. To prevent "starvation loops"—where a head loses its memory and subsequently fails to process future information—the framework includes a robust protection mechanism that ensures every head maintains a minimum baseline of capacity.

Performance and Compatibility

ReasonAlloc is designed as a "plug-and-play" framework, meaning it can be integrated with existing token-eviction policies without requiring model retraining or introducing significant computational overhead. In tests on mathematical reasoning benchmarks like MATH-500 and AIME 2024, the framework outperformed standard uniform-budget methods and static, heuristic-based approaches like Pyramid-RKV. The improvements were most pronounced when the total memory budget was small, demonstrating that ReasonAlloc is highly effective at maximizing the utility of limited cache space during complex reasoning tasks.

Comments (0)

No comments yet

Be the first to share your thoughts!