Microsoft AI has officially released MAI-Transcribe-1.5, the second iteration of its in-house speech-to-text model family. Designed for production-grade transcription workloads, the model offers expanded language support, improved inference speeds for long-form audio, and a new keyword biasing feature. The system is currently available through Microsoft’s Azure AI Foundry platform and is being integrated into various enterprise tools, including Copilot, Teams, GitHub, and Dynamics 365 Contact Centre.
Enhanced Performance and Language Coverage
MAI-Transcribe-1.5 supports 43 languages within a single system, a significant increase from the 25 languages covered by its predecessor. This expansion includes 10 South Asian languages—such as Bengali, Telugu, and Tamil—and eight European languages, including Greek, Ukrainian, and Catalan. Microsoft reports that these additions were achieved without compromising the model's overall accuracy.
The model demonstrates strong performance metrics, achieving a 2.4% Word-Error-Rate (WER) on the Artificial Analysis leaderboard, where it currently ranks third. Additionally, Microsoft claims the model achieves best-in-class accuracy across all 43 languages on the FLEURS multilingual transcription benchmark.
Speed and Efficiency in Long-Form Transcription
A primary focus of this release is the optimization of long-form audio processing. MAI-Transcribe-1.5 is capable of transcribing an hour of audio in under 15 seconds. According to Microsoft, this represents a speed increase of up to 5.7 times compared to the previous MAI-Transcribe-1 model. When compared to other industry models, such as Gemini 3.1, Scribe v2, and GPT-4o-Transcribe, Microsoft reports that its new model offers up to 5x faster inference on long audio files.
Domain-Aware Keyword Biasing
To address the challenges of transcribing niche vocabulary, Microsoft has introduced entity biasing. Users can supply a list of up to 200 domain-specific keywords, such as product names, medical terms, or internal acronyms. The model uses shared context to determine when to apply these biases, rather than forcing matches blindly. Microsoft reports that this feature leads to a 30% WER reduction on the FLEURS benchmark.
While the model offers significant advancements, there are current limitations to its functionality. It does not yet support speaker diarization, meaning it cannot provide speaker labels, and it lacks a native streaming API, which limits its use in real-time scenarios. Despite these constraints, the model is positioned to support various enterprise applications, including video captioning, accessibility tools, call center analytics, and voice agent development. The system also includes automatic language identification, allowing it to detect the spoken language without requiring manual configuration.

Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!