Back to AI Research

AI Research

Vortex: Efficient and Programmable Sparse Attention... | AI Research

Key Takeaways

  • Vortex is a system designed to simplify and accelerate the development of sparse attention algorithms for Large Language Models (LLMs).
  • Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow.
  • However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design.
  • Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements.
  • As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms.
Paper AbstractExpand

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.

Vortex is a system designed to simplify and accelerate the development of sparse attention algorithms for Large Language Models (LLMs). As models generate longer sequences, the movement of key-value (KV) cache data becomes a major performance bottleneck. While sparse attention—which selectively processes only the most relevant parts of the data—offers a solution, it has historically been difficult to implement and deploy. Vortex bridges this gap by providing a programmable framework that allows researchers and AI agents to prototype new algorithms quickly while ensuring they run efficiently on modern, high-performance serving systems.

A Programmable Approach to Sparse Attention

The core of Vortex is a Python-embedded frontend language called vFlow. It allows users to define sparse attention algorithms using a simple, "single-request" mental model, treating tensors as if they were contiguous and easy to manage. By abstracting away the complex, non-contiguous memory layouts used in modern LLM serving systems, vFlow enables users to focus on the logic of the algorithm rather than the low-level engineering required to make it run on a GPU.

The vTensor Abstraction

To translate these high-level programs into high-performance code, Vortex uses an underlying system called vTensor. This system acts as an interpreter that understands the "paged" memory layouts used in modern serving stacks. By representing tensors with explicit layout metadata, vTensor allows the system to perform complex operations—such as selecting specific blocks of data or calculating relevance scores—without needing to manually manage memory or perform redundant data movement. This makes the framework both modular and compatible with existing infrastructure.

Accelerating AI-Driven Research

Vortex is specifically built to support autonomous experimentation. Because it provides a standardized, programmable interface, AI agents can use it to automatically generate, test, and refine diverse sparse attention algorithms. In testing, these agent-generated algorithms achieved up to 3.46 times higher throughput than standard full attention while maintaining the same level of accuracy.

Real-World Performance and Scalability

Beyond rapid prototyping, Vortex delivers significant real-world performance gains. It has been successfully applied to emerging model architectures and massive, 229-billion-parameter models that are typically difficult to experiment with. On modern hardware like the NVIDIA B200, Vortex achieved up to 4.7 times higher throughput on specific flash-attention-based models and 1.37 times higher throughput on large-scale models. By integrating seamlessly with existing serving stacks, Vortex ensures that theoretical efficiency gains in research are successfully translated into practical, end-to-end speedups for LLM inference.

Comments (0)

No comments yet

Be the first to share your thoughts!