Back to AI Research

AI Research

When Audio-Language Models Fail to Leverage Multimo... | AI Research

Key Takeaways

  • Automatic speech recognition (ASR) systems often struggle to accurately transcribe the speech of individuals with dysarthria or other atypical speech pattern...
  • Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech.
  • Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information.
  • Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers.
  • These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.
Paper AbstractExpand

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.

Automatic speech recognition (ASR) systems often struggle to accurately transcribe the speech of individuals with dysarthria or other atypical speech patterns. This paper investigates whether modern audio-language models can improve their performance by using additional clinical context—such as diagnosis labels, speech severity ratings, and detailed clinical descriptions—provided during inference. The authors introduce a new benchmark based on the Speech Accessibility Project (SAP) to test if these models can effectively leverage this information to produce more accurate transcriptions.

The Challenge of Multimodal Context

The researchers tested nine different audio-language models to see if providing them with clinical context would improve their transcription accuracy. Surprisingly, the study found that these models do not meaningfully utilize this information. In many cases, providing diagnosis-informed or clinically detailed prompts resulted in negligible improvements and sometimes even degraded the system's performance, as measured by the word error rate (WER).

Improving Performance with Fine-Tuning

To address these shortcomings, the authors moved beyond simple prompting and explored context-dependent fine-tuning. By using LoRA (Low-Rank Adaptation) with a mixture of clinical prompt formats, they achieved a significant improvement in accuracy. This approach resulted in a word error rate of 0.066, representing a 52% relative reduction compared to the frozen baseline models. Notably, this method maintained high performance even when clinical context was not available.

Key Findings and Impact

The study’s subgroup analyses highlighted that these improvements were particularly effective for speakers with Down syndrome and those with mild speech severity. By identifying that current models fail to naturally incorporate clinical context, the authors have established a new testbed for future research. This work provides a clearer understanding of the current limitations in ASR technology and offers a path forward for developing more inclusive and accessible speech recognition systems.

Comments (0)

No comments yet

Be the first to share your thoughts!