Kyutai Releases Hibiki: A 2.7B Real-Time Speech-to-Speech and Speech-to-Text Translation with Near-Human Quality and Voice Transfer

Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibi…

Open original source

Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity.

It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays. Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters.

The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.