Back to AI Research

AI Research

TOPS: First-Principles Visual Token Pruning via Con... | AI Research

Key Takeaways

  • TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference Multimodal Large Language Models (M...
  • Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead.
  • Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions.
  • Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning.
  • In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets.
Paper AbstractExpand

Multimodal large language models (MLLMs) have achieved strong multimodal reasoning capabilities, but their efficiency is limited by the large number of visual tokens, which introduces substantial computational overhead. Visual token pruning offers a natural solution, yet existing methods are imperfect: attention-based criteria tend to retain redundant tokens, while diversity-based criteria are often agnostic to user instructions. Even methods that combine multiple criteria still lack a principled formulation of the intrinsic objective of token pruning. In this paper, we revisit visual token pruning from a first-principles perspective and formulate it as constructing Token Optimal Preservation Sets. Through a top-down information-theoretic analysis, we identify three fundamental principles for effective token selection: Task Relevance, Information Coverage, and Semantic Diversity. Based on these principles, we propose TOPS, a training-free and model-agnostic pruning module that can be applied to various MLLMs. Extensive experiments on 7 MLLM backbones and 14 benchmarks demonstrate that TOPS outperforms prior methods under diverse pruning settings. Notably, on LLaVA-NeXT, TOPS removes 77.8% of visual tokens while preserving 100.0% and 100.6% performance on its 7B and 13B models, respectively, suggesting that pruning redundant visual tokens can sometimes mitigate hallucination and inspire future lightweight MLLM design.

TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference
Multimodal Large Language Models (MLLMs) have become powerful tools for reasoning, but they are often held back by the sheer number of visual tokens they process. Because self-attention mechanisms scale quadratically with the number of tokens, processing high-resolution images or long video sequences creates significant computational and memory bottlenecks. This paper introduces TOPS, a new approach to visual token pruning that moves away from trial-and-error heuristics to a more principled, information-theoretic framework for selecting the most important visual information.

A First-Principles Approach

Existing methods for pruning visual tokens often rely on simple scoring schemes, such as prioritizing tokens with high attention weights or those that appear diverse. However, these methods often lack a formal justification for why their chosen tokens are truly optimal. The researchers behind TOPS reframe this as an "optimal subset selection" problem. By applying information theory, they identified three fundamental pillars for effective pruning:

  • Task Relevance: Ensuring the selected tokens are directly useful for answering the user's specific query.

  • Information Coverage: Ensuring the subset retains enough information to represent the original visual input accurately.

  • Semantic Diversity: Ensuring the selected tokens are not redundant, preventing the model from wasting capacity on nearly identical visual information.

How TOPS Works

TOPS is designed as a training-free, model-agnostic module, meaning it can be plugged into various existing MLLM architectures without requiring additional training. It operates through a two-stage pipeline. In the first stage, it performs a coarse reduction of visual tokens immediately after the image is processed by the vision encoder. In the second stage, it performs a more refined selection inside the language model layers. By dynamically calculating the relevance, coverage, and diversity of tokens, TOPS greedily builds an "optimal preservation set" that keeps the most critical visual evidence while discarding the rest.

Performance and Efficiency

Extensive testing across seven different MLLM backbones and 14 benchmarks shows that TOPS consistently outperforms previous pruning methods. By removing a large percentage of redundant tokens, the model can maintain or even improve its performance. For example, on the LLaVA-NeXT model, TOPS was able to remove 77.8% of visual tokens while maintaining 100% to 100.6% of the original performance. The researchers also noted that by removing redundant visual noise, TOPS can sometimes help mitigate model hallucinations, leading to more accurate and efficient multimodal reasoning.

Key Takeaways

The success of TOPS suggests that the quality of visual tokens is far more important than the quantity. By focusing on the intrinsic information value of each token rather than just its raw attention score, the method provides a scalable way to handle high-resolution and video-based inputs. Because it is model-agnostic and does not require retraining, it offers a practical solution for developers looking to design more lightweight and efficient MLLMs without sacrificing the reasoning capabilities that make these models valuable.

Comments (0)

No comments yet

Be the first to share your thoughts!