Back to AI Research

AI Research

MiniMax Sparse Attention | AI Research

Key Takeaways

  • MiniMax Sparse Attention (MSA) is a new approach designed to make ultra-long-context Large Language Models (LLMs) more efficient and practical to deploy.
  • We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA).
  • Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs.
  • To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access.
  • On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context.
Paper AbstractExpand

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: this https URL . A production-grade natively multimodal model powered by MSA has been publicly released at: this https URL .

MiniMax Sparse Attention (MSA) is a new approach designed to make ultra-long-context Large Language Models (LLMs) more efficient and practical to deploy. As models are increasingly used for complex tasks like code reasoning and agentic workflows, they must process hundreds of thousands to millions of tokens. Standard attention mechanisms become prohibitively expensive at this scale due to their quadratic computational cost. MSA addresses this by using a blockwise sparse attention strategy that significantly reduces compute requirements while maintaining the performance of standard Grouped Query Attention (GQA).

A Two-Branch Architecture

MSA functions through a two-stage process that separates the selection of information from the actual computation. First, a lightweight "Index Branch" scores blocks of key-value pairs and selects a small subset of the most relevant blocks for each attention group. This branch is designed to be computationally inexpensive. Second, the "Main Branch" performs exact attention calculations only on the tokens within those selected blocks. By focusing the model's "attention" on a limited, high-value subset of the context rather than the entire sequence, MSA avoids the massive overhead of traditional softmax attention.

Co-Designing Hardware and Software

To ensure that theoretical efficiency translates into real-world speed, the researchers co-designed MSA with specialized GPU execution paths. This includes an "exp-free" Top-k selection kernel that bypasses unnecessary mathematical operations, allowing the model to identify relevant blocks faster. Additionally, the team implemented a "KV-outer" sparse attention approach that optimizes how data is moved and processed on GPU tensor cores. These optimizations allow the system to handle the granular, block-based access patterns of MSA without sacrificing performance.

Performance and Scalability

When tested on a 109B-parameter model, MSA demonstrated that it could match the quality of standard GQA while drastically improving efficiency. At a context length of 1 million tokens, MSA reduced the per-token attention compute by 28.4 times. In practical wall-clock tests on H800 GPUs, the system achieved 14.2 times faster prefill speeds and 7.6 times faster decoding speeds. These results suggest that MSA is a viable path for scaling LLMs to handle massive amounts of information without requiring exponentially more hardware resources.

Training for Stability

Because the process of selecting specific blocks is inherently non-differentiable, the researchers developed a specific training procedure to ensure the model learns effectively. They use a KL alignment loss to guide the Index Branch, forcing it to mimic the attention patterns of the Main Branch. To maintain stability, they also employ techniques such as gradient detachment, an initial warmup phase where the model uses full attention, and a requirement that the "local block" (the immediate context of the query) is always included. This ensures that the model remains coherent and accurate throughout the training process.

Comments (0)

No comments yet

Be the first to share your thoughts!