AI Research

Senses Wide Shut: A Representation-Action Gap in Om... | AI Research

Key Takeaways

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs investigates a fundamental blind spot in modern AI: when an omnimodal model—one that processe...
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action?
The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants.
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.
Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Paper AbstractExpand

When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs investigates a fundamental blind spot in modern AI: when an omnimodal model—one that processes video, audio, and text—is presented with a question containing a false premise, does it fail because it cannot perceive the error, or because it fails to act on that knowledge? While these models are designed to ground their reasoning in sensory data, this research reveals that they often "blindly trust" text even when it directly contradicts the video or audio they are processing.

The IMAVB Benchmark

To test this, the researchers created IMAVB, a new benchmark consisting of 500 long-form movie clips. The design uses a 2x2 structure, crossing two modalities (vision and audio) with two conditions (standard and misleading). In the misleading version, the researchers swap a single detail—such as the color of a shirt or a sound effect—to create a conflict between the textual question and the actual sensory input. This allows the team to measure whether a model can detect these mismatches without relying on the model's general comprehension abilities.

The Representation-Action Gap

The study tested eight open-source omnimodal models and Gemini 3.1 Pro, uncovering a consistent "Representation-Action Gap." Internally, the models' hidden states successfully encode the mismatch between the false premise and the sensory reality. However, this internal awareness rarely translates into the model's final output. Most models fall into an "under-rejection" trap, where they ignore the error and answer the question as if the false premise were true. A few models, like Qwen3-Omni and Gemini 3.1 Pro, exhibit "over-rejection," where they catch more errors but sacrifice accuracy on standard, truthful questions.

Modality Asymmetry and Intervention

The research highlights a clear asymmetry between vision and audio: models are significantly worse at grounding their answers in audio than in vision. This gap is resistant to various prompt-based interventions, suggesting that the issue is deeply embedded in how these models process information. To address this, the authors introduced a diagnostic tool called Probe-Guided Logit Adjustment (PGLA). By re-injecting the internal "mismatch" signal directly into the model's output layer, they were able to improve rejection performance by an average of 15 percentage points. This suggests that the bottleneck for omnimodal grounding is not a lack of perception, but a failure in translating that perception into action.

Comments (0)

No comments yet

Be the first to share your thoughts!