Tencent AI Lab has officially released Covo-Audio, a 7B-parameter Large Audio Language Model (LALM) designed to unify speech processing and language intelligence. By utilizing a single, end-to-end architecture, the system directly processes continuous audio inputs and generates high-fidelity audio outputs, effectively eliminating the need for traditional cascaded pipelines that rely on separate ASR, LLM, and TTS components.
Architecture and Hierarchical Interleaving
The Covo-Audio framework is built upon the Qwen2.5-7B-Base model, which has been adapted to handle interleaved sequences of acoustic features and textual tokens. The system integrates a Whisper-large-v3 encoder to process audio at 50 Hz, while a specialized audio adapter uses downsampling modules to reduce the frame rate to 6.25 Hz for the LLM backbone. For output, the model employs a WavLM-large-based tokenizer and a Flow-Matching-based decoder with a BigVGAN vocoder to reconstruct 24K waveforms.
A core innovation of the model is its Hierarchical Tri-modal Speech-Text Interleaving strategy. This approach aligns continuous acoustic features, discrete speech tokens, and natural language text at both phrase and sentence levels. By utilizing sequential interleaving and parallel integration patterns, the model maintains structural coherence and preserves global semantic integrity during long-form utterances.
Full-Duplex Interaction and Intelligence Decoupling
To support real-time, simultaneous communication, the research team developed the Covo-Audio-Chat-FD variant. This system manages complex conversational dynamics through specific architectural tokens: the THINK token for listening, the SHIFT token for speaking, and the BREAK token for detecting barge-ins. The model processes audio in a chunk-streaming manner, with user and model streams interleaved in a 1:4 ratio to ensure smooth, full-duplex interaction.
Furthermore, the team introduced an Intelligence-Speaker Decoupling strategy to allow for flexible voice customization. By reformatting high-quality TTS recordings into pseudo-conversations with masked text loss, the model can inherit the naturalness of a specific speaker without requiring extensive, speaker-specific dialogue datasets. This method preserves the model's reasoning capabilities while significantly reducing the costs associated with building large-scale dialogue data.
Reasoning and Performance
Covo-Audio incorporates Chain-of-Thought reasoning and Group Relative Policy Optimization to enhance its performance on complex tasks. The model is optimized using a verifiable composite reward function that accounts for accuracy, format adherence, consistency, and reasoning depth. In evaluations, the 7B-scale model achieved a 75.30% average score on the MMAU benchmark and a 66.64% average accuracy on the MMSU benchmark. Additionally, the Covo-Audio-Chat variant demonstrated state-of-the-art results for empathetic interaction in Mandarin on the VStyle benchmark.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!