Back to AI Research

AI Research

FlowEdit: Associative Memory for Lifelong Pronuncia... | AI Research

Key Takeaways

  • FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS Modern text-to-speech (TTS) systems are highly capable but suffer fro...
  • Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained.
  • We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates.
  • When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory.
  • At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching.
Paper AbstractExpand

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
Modern text-to-speech (TTS) systems are highly capable but suffer from a major limitation: once deployed, they are static. If a model mispronounces a specific proper noun or foreign word, it will continue to do so indefinitely unless the entire system is retrained, which is both expensive and risky. FlowEdit introduces a way to fix these pronunciation errors in frozen, already-deployed models without needing to change the model's internal weights or risk "catastrophic forgetting," where fixing one word accidentally breaks the model's ability to speak other words correctly.

How FlowEdit Works

Instead of modifying the model's core architecture, FlowEdit treats pronunciation correction like a form of speech therapy. When a user provides a correct audio example of a word, the system calculates a small "perturbation" vector—a subtle adjustment to the text input that guides the model toward the correct pronunciation.
These adjustments are stored in a Modern Hopfield Network, which acts as an external, content-addressable memory. When the model encounters that word again, it retrieves the stored correction and applies it to the input. Because this process happens in the text embedding space rather than the model's weight matrices, the base model remains completely untouched and stable.

Intelligent Retrieval

A key feature of FlowEdit is its use of "fuzzy morphological matching." Because the system uses soft attention to retrieve corrections, it doesn't require a perfect, word-for-word match. For example, if the system learns the correct pronunciation for "Linux," it can intelligently apply that knowledge to variations like "Linux's" or "Linuxed." A similarity gate ensures that these corrections are only applied when relevant, preventing the system from accidentally altering the pronunciation of unrelated words.

Performance and Efficiency

FlowEdit significantly outperforms traditional methods like fine-tuning or lexicon overrides. On a benchmark of 312 multilingual proper nouns, it reduced the Phoneme Error Rate by 92.7% compared to the baseline. Perhaps most importantly, it maintains "zero forgetting"—the model's performance on general speech remains identical to its original state, whereas traditional fine-tuning often degrades the model's overall quality.
The process is also highly efficient. Corrections can be generated in approximately 15 seconds on a single GPU, and the memory-efficient design allows the system to scale to thousands of corrections without significant latency. Because the corrections are learned in a speaker-agnostic text space, a single correction can be applied across different voices, making it a practical solution for multi-speaker TTS deployments.

Comments (0)

No comments yet

Be the first to share your thoughts!