MiniMax Sparse Attention

MiniMax Sparse Attention (MSA) is a new approach designed to make ultra-long-context Large Language Models (LLMs) more efficient and practical to deploy. As models are increasingly used for complex tasks like code reasoning and agentic workflows, they must process hundreds of thousands to millions of tokens. Standard attention mechanisms become prohibitively expensive at this scale due to their quadratic computational cost. MSA addresses this by using a blockwise sparse attention strategy that significantly reduces compute requirements while maintaining the performance of standard Grouped Query Attention (GQA).

A Two-Branch Architecture

MSA functions through a two-stage process that separates the selection of information from the actual computation. First, a lightweight "Index Branch" scores blocks of key-value pairs and selects a small subset of the most relevant blocks for each attention group. This branch is designed to be computationally inexpensive. Second, the "Main Branch" performs exact attention calculations only on the tokens within those selected blocks. By focusing the model's "attention" on a limited, high-value subset of the context rather than the entire sequence, MSA avoids the massive overhead of traditional softmax attention.

Co-Designing Hardware and Software

To ensure that theoretical efficiency translates into real-world speed, the researchers co-designed MSA with specialized GPU execution paths. This includes an "exp-free" Top-k selection kernel that bypasses unnecessary mathematical operations, allowing the model to identify relevant blocks faster. Additionally, the team implemented a "KV-outer" sparse attention approach that optimizes how data is moved and processed on GPU tensor cores. These optimizations allow the system to handle the granular, block-based access patterns of MSA without sacrificing performance.

Performance and Scalability

When tested on a 109B-parameter model, MSA demonstrated that it could match the quality of standard GQA while drastically improving efficiency. At a context length of 1 million tokens, MSA reduced the per-token attention compute by 28.4 times. In practical wall-clock tests on H800 GPUs, the system achieved 14.2 times faster prefill speeds and 7.6 times faster decoding speeds. These results suggest that MSA is a viable path for scaling LLMs to handle massive amounts of information without requiring exponentially more hardware resources.

Training for Stability

Because the process of selecting specific blocks is inherently non-differentiable, the researchers developed a specific training procedure to ensure the model learns effectively. They use a KL alignment loss to guide the Index Branch, forcing it to mimic the attention patterns of the Main Branch. To maintain stability, they also employ techniques such as gradient detachment, an initial warmup phase where the model uses full attention, and a requirement that the "local block" (the immediate context of the query) is always included. This ensures that the model remains coherent and accurate throughout the training process.

MiniMax Sparse Attention

A Two-Branch Architecture

Co-Designing Hardware and Software

Performance and Scalability

Training for Stability

Comments (0)

No comments yet