Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibi…
Kyutai has introduced Hibiki, a 2.7 billion-parameter decoder model for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation, currently supporting French-to-English. Hibiki operates at a 12.5Hz framerate and uses a neural audio codec to efficiently compress audio while maintaining fidelity.
It employs contextual alignment, leveraging a text translation model's perplexity to optimize timing for speech generation and dynamically adjust translation delays. Hibiki achieves an ASR-BLEU score of 30.5, surpassing existing baselines, and human evaluations rate its naturalness at 3.73/5, approaching human interpreters.
The model, along with a distilled version for smartphones, is open-source under a permissive CC-BY license, potentially advancing multilingual communication.