Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs investigates a fundamental blind spot in modern AI: when an omnimodal model—one that processes video, audio, and text—is presented with a question containing a false premise, does it fail because it cannot perceive the error, or because it fails to act on that knowledge? While these models are designed to ground their reasoning in sensory data, this research reveals that they often "blindly trust" text even when it directly contradicts the video or audio they are processing.
The IMAVB Benchmark
To test this, the researchers created IMAVB, a new benchmark consisting of 500 long-form movie clips. The design uses a 2x2 structure, crossing two modalities (vision and audio) with two conditions (standard and misleading). In the misleading version, the researchers swap a single detail—such as the color of a shirt or a sound effect—to create a conflict between the textual question and the actual sensory input. This allows the team to measure whether a model can detect these mismatches without relying on the model's general comprehension abilities.
The Representation-Action Gap
The study tested eight open-source omnimodal models and Gemini 3.1 Pro, uncovering a consistent "Representation-Action Gap." Internally, the models' hidden states successfully encode the mismatch between the false premise and the sensory reality. However, this internal awareness rarely translates into the model's final output. Most models fall into an "under-rejection" trap, where they ignore the error and answer the question as if the false premise were true. A few models, like Qwen3-Omni and Gemini 3.1 Pro, exhibit "over-rejection," where they catch more errors but sacrifice accuracy on standard, truthful questions.
Modality Asymmetry and Intervention
The research highlights a clear asymmetry between vision and audio: models are significantly worse at grounding their answers in audio than in vision. This gap is resistant to various prompt-based interventions, suggesting that the issue is deeply embedded in how these models process information. To address this, the authors introduced a diagnostic tool called Probe-Guided Logit Adjustment (PGLA). By re-injecting the internal "mismatch" signal directly into the model's output layer, they were able to improve rejection performance by an average of 15 percentage points. This suggests that the bottleneck for omnimodal grounding is not a lack of perception, but a failure in translating that perception into action.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!