Kyutai Releases Hibiki: A 2.7B Real-Time Speech-to-Speech and Speech-to-Text Translation with Near-Human Quality and Voice Transfer

Key Takeaways

  • Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English.
  • Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity.
  • It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays.
  • Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters.
  • The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.

Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity. It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays. Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters. The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.

Comments (0)

No comments yet

Be the first to share your thoughts!