Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity. It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays. Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters. The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.
Kyutai Releases Hibiki: A 2.7B Real-Time Speech-to-Speech and Speech-to-Text Translation with Near-Human Quality and Voice Transfer
Key Takeaways
- Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English.
- Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity.
- It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays.
- Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters.
- The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!