TOPS: First-Principles Visual Token Pruning via Con...

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
Multimodal Large Language Models (MLLMs) have become powerful tools for reasoning, but they are often held back by the sheer number of visual tokens they process. Because self-attention mechanisms scale quadratically with the number of tokens, processing high-resolution images or long video sequences creates significant computational and memory bottlenecks. This paper introduces TOPS, a new approach to visual token pruning that moves away from trial-and-error heuristics to a more principled, information-theoretic framework for selecting the most important visual information.

A First-Principles Approach

Existing methods for pruning visual tokens often rely on simple scoring schemes, such as prioritizing tokens with high attention weights or those that appear diverse. However, these methods often lack a formal justification for why their chosen tokens are truly optimal. The researchers behind TOPS reframe this as an "optimal subset selection" problem. By applying information theory, they identified three fundamental pillars for effective pruning:

Task Relevance: Ensuring the selected tokens are directly useful for answering the user's specific query.
Information Coverage: Ensuring the subset retains enough information to represent the original visual input accurately.
Semantic Diversity: Ensuring the selected tokens are not redundant, preventing the model from wasting capacity on nearly identical visual information.

How TOPS Works

TOPS is designed as a training-free, model-agnostic module, meaning it can be plugged into various existing MLLM architectures without requiring additional training. It operates through a two-stage pipeline. In the first stage, it performs a coarse reduction of visual tokens immediately after the image is processed by the vision encoder. In the second stage, it performs a more refined selection inside the language model layers. By dynamically calculating the relevance, coverage, and diversity of tokens, TOPS greedily builds an "optimal preservation set" that keeps the most critical visual evidence while discarding the rest.

Performance and Efficiency

Extensive testing across seven different MLLM backbones and 14 benchmarks shows that TOPS consistently outperforms previous pruning methods. By removing a large percentage of redundant tokens, the model can maintain or even improve its performance. For example, on the LLaVA-NeXT model, TOPS was able to remove 77.8% of visual tokens while maintaining 100% to 100.6% of the original performance. The researchers also noted that by removing redundant visual noise, TOPS can sometimes help mitigate model hallucinations, leading to more accurate and efficient multimodal reasoning.

Key Takeaways

The success of TOPS suggests that the quality of visual tokens is far more important than the quantity. By focusing on the intrinsic information value of each token rather than just its raw attention score, the method provides a scalable way to handle high-resolution and video-based inputs. Because it is model-agnostic and does not require retraining, it offers a practical solution for developers looking to design more lightweight and efficient MLLMs without sacrificing the reasoning capabilities that make these models valuable.

TOPS: First-Principles Visual Token Pruning via Con... | AI Research

Key Takeaways

A First-Principles Approach

How TOPS Works

Performance and Efficiency

Key Takeaways

Comments (0)

No comments yet