Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
Reinforcement learning (RL) has become a powerful tool for teaching Large Language Models (LLMs) to reason. In text-only tasks, researchers have found that focusing training on "high-entropy" tokens—those where the model is most uncertain—leads to better performance. However, this paper reveals that this strategy fails when applied to visual reasoning. The authors introduce VEPO, a new framework that improves how models learn by identifying tokens that are both visually grounded and informative, rather than relying on text-based uncertainty alone.
The Failure of Entropy in Visual Reasoning
In text-based RL, high-entropy tokens act as "forking points" that drive exploration and provide the most useful feedback for learning. The researchers discovered that this mechanism collapses in visual reasoning. When a model processes images, many critical decisions are actually linked to tokens with naturally low entropy—meaning the model is confident because it is correctly grounding its reasoning in the visual data. Because standard entropy-based methods ignore these low-entropy, vision-sensitive tokens, they fail to capture the most important parts of the reasoning process, often performing no better than random selection.
How VEPO Works
To solve this, the authors developed VEPO (Vision-Entropy token-selection for Policy Optimization). Instead of looking only at text uncertainty, VEPO performs a "counterfactual" check: it compares the model’s output when viewing the original image against its output when viewing a noise-perturbed version of the same image.
The framework uses two specific signals to measure visual sensitivity:
Jensen–Shannon Divergence (JSD): Measures how much the model’s prediction distribution shifts when the image is corrupted.
Entropy Gap: Measures how much the model’s confidence changes due to visual corruption.
VEPO combines these visual signals with the standard token entropy using a "multiplicative coupling" method. This ensures that the model prioritizes tokens that are simultaneously highly informative and strongly connected to the visual input, effectively filtering out noise and focusing the learning process on the most relevant decision points.
Key Results
The researchers tested VEPO using the Qwen2.5-VL model across several benchmarks, including geometry and math-heavy visual tasks. The results show that VEPO significantly outperforms the standard entropy-only approach, achieving a 2.28-point improvement at the 7B-scale and a 3.15-point improvement at the 3B-scale. Furthermore, the model demonstrated strong performance that allowed it to compete with or exceed the results of much larger, open-source multimodal models, proving that better credit assignment during training can be more effective than simply increasing model size.
Why This Matters
This research highlights a fundamental gap in how we train multimodal AI. By demonstrating that visual reasoning requires a different approach to credit assignment than pure text, the authors provide a more efficient way to train vision-language models. The success of VEPO suggests that for AI to truly "see" and reason about images, training frameworks must explicitly integrate visual sensitivity into the learning loop, ensuring that the model learns from the visual evidence that actually drives its reasoning.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!