How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
Modern text-to-speech (TTS) systems allow users to control voice characteristics—such as tone, emotion, or pace—using natural language captions. While these systems produce high-quality audio, it has remained unclear how specific words in a caption actually influence the final sound. This paper introduces a new method to visualize and measure this influence, providing the first deep dive into how natural language instructions shape speech in diffusion-based models.
Tracking Influence with Cross-Attention
To understand how a model "listens" to a caption, the researchers adapted a technique called Diffusion Attentive Attribution Maps (DAAM), originally designed for image generation. By intercepting the model’s internal "cross-attention" mechanisms—the pathways where the model connects text to audio—the team created heatmaps that track how much attention the model pays to specific words at every stage of the generation process. They analyzed 3,600 different combinations of style captions and text transcripts, examining 25 transformer layers and 24 generation steps to see how the model builds the final waveform.
Global vs. Local Control
The study reveals that the model treats different types of words in distinct ways. Style-related words, such as "calm" or "harsh," act as global modulators; they show very low temporal variance, meaning they influence the entire audio output uniformly rather than focusing on a single moment. In contrast, function words (like "the" or "and") show higher variance, suggesting they are used for more localized, structural tasks. Interestingly, the model’s attention to style words is not just abstract; it correlates strongly with measurable acoustic features like pitch (F0) and energy. For example, the word "loud" consistently triggers higher attention in the model when the generated audio reaches its peak energy.
Hierarchical Generation Dynamics
The research highlights a clear "coarse-to-fine" strategy in how the model processes instructions. Style conditioning is most intense during the early stages of the generation process (the first ODE steps) and becomes more refined as the model moves through its deeper transformer layers. The researchers found that the model reaches its peak "selectivity" at layer 17, where it narrows its focus to the most critical style-defining information. As the generation nears completion, the model shifts its attention away from broad style descriptors and toward function words, which help finalize the timing and phrasing of the speech.
Implications for Expressive TTS
This study provides a roadmap for diagnosing why a TTS system might fail to capture a requested style. By identifying that style tokens act as global controllers that peak in early generation steps, developers can better understand how to tune these models for improved controllability. This work marks the first time that the internal "reasoning" of speech diffusion models has been mapped, offering a foundation for building more predictable and expressive voice synthesis technology.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!