Back to AI Research

AI Research

Entropy Is Not Enough: Unlocking Effective Reinforc... | AI Research

Key Takeaways

  • Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection Reinforcement learning (RL) has be...
  • Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy.
  • Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale.
  • Ablations further substantiate the soundness of our method.
  • Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
Paper AbstractExpand

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
Reinforcement learning (RL) has become a powerful tool for teaching Large Language Models (LLMs) to reason. In text-only tasks, researchers have found that focusing training on "high-entropy" tokens—those where the model is most uncertain—leads to better performance. However, this paper reveals that this strategy fails when applied to visual reasoning. The authors introduce VEPO, a new framework that improves how models learn by identifying tokens that are both visually grounded and informative, rather than relying on text-based uncertainty alone.

The Failure of Entropy in Visual Reasoning

In text-based RL, high-entropy tokens act as "forking points" that drive exploration and provide the most useful feedback for learning. The researchers discovered that this mechanism collapses in visual reasoning. When a model processes images, many critical decisions are actually linked to tokens with naturally low entropy—meaning the model is confident because it is correctly grounding its reasoning in the visual data. Because standard entropy-based methods ignore these low-entropy, vision-sensitive tokens, they fail to capture the most important parts of the reasoning process, often performing no better than random selection.

How VEPO Works

To solve this, the authors developed VEPO (Vision-Entropy token-selection for Policy Optimization). Instead of looking only at text uncertainty, VEPO performs a "counterfactual" check: it compares the model’s output when viewing the original image against its output when viewing a noise-perturbed version of the same image.
The framework uses two specific signals to measure visual sensitivity:

  • Jensen–Shannon Divergence (JSD): Measures how much the model’s prediction distribution shifts when the image is corrupted.

  • Entropy Gap: Measures how much the model’s confidence changes due to visual corruption.
    VEPO combines these visual signals with the standard token entropy using a "multiplicative coupling" method. This ensures that the model prioritizes tokens that are simultaneously highly informative and strongly connected to the visual input, effectively filtering out noise and focusing the learning process on the most relevant decision points.

Key Results

The researchers tested VEPO using the Qwen2.5-VL model across several benchmarks, including geometry and math-heavy visual tasks. The results show that VEPO significantly outperforms the standard entropy-only approach, achieving a 2.28-point improvement at the 7B-scale and a 3.15-point improvement at the 3B-scale. Furthermore, the model demonstrated strong performance that allowed it to compete with or exceed the results of much larger, open-source multimodal models, proving that better credit assignment during training can be more effective than simply increasing model size.

Why This Matters

This research highlights a fundamental gap in how we train multimodal AI. By demonstrating that visual reasoning requires a different approach to credit assignment than pure text, the authors provide a more efficient way to train vision-language models. The success of VEPO suggests that for AI to truly "see" and reason about images, training frameworks must explicitly integrate visual sensitivity into the learning loop, ensuring that the model learns from the visual evidence that actually drives its reasoning.

Comments (0)

No comments yet

Be the first to share your thoughts!