Audio-visual large language models (AVLLMs) are designed to process audio, video, and text simultaneously, allowing them to reason about the world in a way that mirrors human perception. However, because these models are complex, it is often unclear how they actually integrate information from different senses. This paper investigates the internal "information flow" of AVLLMs to understand exactly where and how audio and visual data are combined within the model's internal token representations.
Tracking Information Flow
To understand how these models think, the researchers developed a "unimodal dominance" framework. They identified scenarios where the model relies heavily on one sense—for example, identifying a sport based on visual cues while the audio remains ambiguous. By using a technique called causal tracing, the researchers could "patch" or swap specific hidden states within the model to see which tokens were responsible for carrying information from the dominant sense to the other. This allowed them to pinpoint the exact locations where the model stores integrated cross-modal knowledge.
The Role of Sink Tokens
The study reveals that AVLLMs do not store integrated audio-visual information uniformly across all tokens. Instead, this information is primarily concentrated in "sink tokens"—specific tokens that receive disproportionately high attention weights during processing. The researchers discovered that these sink tokens are not all the same; they function as specialized hubs. Some sink tokens are "unimodal," focusing only on their native sense, while others are "cross-modal sink tokens" that specialize in storing information derived from the other modality.
Improving Model Reliability
By identifying these cross-modal sink tokens as the primary carriers of integrated information, the researchers proposed a simple, training-free method to improve model performance. By strategically steering the model’s attention toward these cross-modal sink tokens, they were able to enhance the integration of audio and visual inputs. This approach effectively mitigates "object hallucinations," where a model might incorrectly describe items or events that are not actually present in the input, leading to more accurate and robust multimodal reasoning.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!