AI Research

Probing Cross-modal Information Hubs in Audio-Visua... | AI Research

Key Takeaways

Audio-visual large language models (AVLLMs) are designed to process audio, video, and text simultaneously, allowing them to reason about the world in a way t...
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities.
In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms.
However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored.
Through an analysis of multiple recent AVLLMs, we uncover two common findings.

Paper AbstractExpand

Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at this https URL .

Audio-visual large language models (AVLLMs) are designed to process audio, video, and text simultaneously, allowing them to reason about the world in a way that mirrors human perception. However, because these models are complex, it is often unclear how they actually integrate information from different senses. This paper investigates the internal "information flow" of AVLLMs to understand exactly where and how audio and visual data are combined within the model's internal token representations.

Tracking Information Flow

To understand how these models think, the researchers developed a "unimodal dominance" framework. They identified scenarios where the model relies heavily on one sense—for example, identifying a sport based on visual cues while the audio remains ambiguous. By using a technique called causal tracing, the researchers could "patch" or swap specific hidden states within the model to see which tokens were responsible for carrying information from the dominant sense to the other. This allowed them to pinpoint the exact locations where the model stores integrated cross-modal knowledge.

The Role of Sink Tokens

The study reveals that AVLLMs do not store integrated audio-visual information uniformly across all tokens. Instead, this information is primarily concentrated in "sink tokens"—specific tokens that receive disproportionately high attention weights during processing. The researchers discovered that these sink tokens are not all the same; they function as specialized hubs. Some sink tokens are "unimodal," focusing only on their native sense, while others are "cross-modal sink tokens" that specialize in storing information derived from the other modality.

Improving Model Reliability

By identifying these cross-modal sink tokens as the primary carriers of integrated information, the researchers proposed a simple, training-free method to improve model performance. By strategically steering the model’s attention toward these cross-modal sink tokens, they were able to enhance the integration of audio and visual inputs. This approach effectively mitigates "object hallucinations," where a model might incorrectly describe items or events that are not actually present in the input, leading to more accurate and robust multimodal reasoning.

Comments (0)

No comments yet

Be the first to share your thoughts!