FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
Modern text-to-speech (TTS) systems are highly capable but suffer from a major limitation: once deployed, they are static. If a model mispronounces a specific proper noun or foreign word, it will continue to do so indefinitely unless the entire system is retrained, which is both expensive and risky. FlowEdit introduces a way to fix these pronunciation errors in frozen, already-deployed models without needing to change the model's internal weights or risk "catastrophic forgetting," where fixing one word accidentally breaks the model's ability to speak other words correctly.
How FlowEdit Works
Instead of modifying the model's core architecture, FlowEdit treats pronunciation correction like a form of speech therapy. When a user provides a correct audio example of a word, the system calculates a small "perturbation" vector—a subtle adjustment to the text input that guides the model toward the correct pronunciation.
These adjustments are stored in a Modern Hopfield Network, which acts as an external, content-addressable memory. When the model encounters that word again, it retrieves the stored correction and applies it to the input. Because this process happens in the text embedding space rather than the model's weight matrices, the base model remains completely untouched and stable.
Intelligent Retrieval
A key feature of FlowEdit is its use of "fuzzy morphological matching." Because the system uses soft attention to retrieve corrections, it doesn't require a perfect, word-for-word match. For example, if the system learns the correct pronunciation for "Linux," it can intelligently apply that knowledge to variations like "Linux's" or "Linuxed." A similarity gate ensures that these corrections are only applied when relevant, preventing the system from accidentally altering the pronunciation of unrelated words.
Performance and Efficiency
FlowEdit significantly outperforms traditional methods like fine-tuning or lexicon overrides. On a benchmark of 312 multilingual proper nouns, it reduced the Phoneme Error Rate by 92.7% compared to the baseline. Perhaps most importantly, it maintains "zero forgetting"—the model's performance on general speech remains identical to its original state, whereas traditional fine-tuning often degrades the model's overall quality.
The process is also highly efficient. Corrections can be generated in approximately 15 seconds on a single GPU, and the memory-efficient design allows the system to scale to thousands of corrections without significant latency. Because the corrections are learned in a speaker-agnostic text space, a single correction can be applied across different voices, making it a practical solution for multi-speaker TTS deployments.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!