Automatic speech recognition (ASR) systems often struggle to accurately transcribe the speech of individuals with dysarthria or other atypical speech patterns. This paper investigates whether modern audio-language models can improve their performance by using additional clinical context—such as diagnosis labels, speech severity ratings, and detailed clinical descriptions—provided during inference. The authors introduce a new benchmark based on the Speech Accessibility Project (SAP) to test if these models can effectively leverage this information to produce more accurate transcriptions.
The Challenge of Multimodal Context
The researchers tested nine different audio-language models to see if providing them with clinical context would improve their transcription accuracy. Surprisingly, the study found that these models do not meaningfully utilize this information. In many cases, providing diagnosis-informed or clinically detailed prompts resulted in negligible improvements and sometimes even degraded the system's performance, as measured by the word error rate (WER).
Improving Performance with Fine-Tuning
To address these shortcomings, the authors moved beyond simple prompting and explored context-dependent fine-tuning. By using LoRA (Low-Rank Adaptation) with a mixture of clinical prompt formats, they achieved a significant improvement in accuracy. This approach resulted in a word error rate of 0.066, representing a 52% relative reduction compared to the frozen baseline models. Notably, this method maintained high performance even when clinical context was not available.
Key Findings and Impact
The study’s subgroup analyses highlighted that these improvements were particularly effective for speakers with Down syndrome and those with mild speech severity. By identifying that current models fail to naturally incorporate clinical context, the authors have established a new testbed for future research. This work provides a clearer understanding of the current limitations in ASR technology and offers a path forward for developing more inclusive and accessible speech recognition systems.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!