Back to AI Research

AI Research

Beyond Acoustic Emotion Recognition: Multimodal Pat... | AI Research

Key Takeaways

  • Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models This research investigates ho...
  • We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline.
  • Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499).
  • Future work will extend this approach to video-based analysis incorporating facial expression and gaze.
  • Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
Paper AbstractExpand

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
This research investigates how we can better measure "Pathos"—the emotional appeal in political speech—using modern technology. While traditional acoustic models focus on the sound of a voice (like pitch and tone), this study explores whether Large Language Models (LLMs) can provide a more accurate analysis by understanding the actual meaning, context, and rhetorical strategies behind a politician's words. By comparing these two approaches against a specialized political scoring system, the paper highlights the limitations of relying solely on sound-based emotion recognition for complex political discourse.

Comparing Analysis Methods

The study uses a 245-second speech by German politician Felix Banaszak to test three different ways of measuring emotion:

  • Acoustic Emotion Recognition (SER): Using the emotion2vec model, which analyzes audio signals to estimate arousal (intensity) and valence (positivity/negativity).

  • Multimodal LLM Analysis: Using Gemini 2.5 Flash, which processes both the audio and the transcript to understand the speech in a context-aware, open-ended way.

  • TRUST-Pathos Scores: A specialized framework that uses a team of three different LLMs to score how unifying or divisive a political statement is.

Why Meaning Matters More Than Sound

The results show a significant gap between sound-based analysis and semantic understanding. The LLM-based analysis (Gemini) correlated strongly with the TRUST-Pathos scores, meaning it successfully captured the political "flavor" of the speech. In contrast, the acoustic model (emotion2vec) showed almost no correlation with the political scores.
The researchers explain this by pointing to the difference between "acoustic valence" and "political-rhetorical valence." For example, a politician might use a high-energy, aggressive tone to deliver a sarcastic remark. An acoustic model might mistake this for "happiness" because of the high energy, whereas an LLM correctly identifies the statement as disapproval.

Evaluating Standard Benchmarks

The paper also conducted a quality check on the EMO-DB database, a standard benchmark used to train many emotion-recognition models. The researchers found that this database has significant structural issues, such as relying on acted speech rather than natural conversation and failing to distinguish between certain emotions like "disgust" without visual cues. These findings suggest that many current emotion-recognition models are being trained on data that does not accurately reflect how people actually speak in real-world political settings.

Future Directions

The study concludes that while acoustic models remain useful for measuring low-level physical intensity (arousal), they are insufficient for capturing the nuances of political communication. The author suggests that future research should combine LLM-based semantic analysis with acoustic features and, eventually, video analysis—such as tracking facial expressions and gaze—to create a more complete picture of how political figures communicate with their audience.

Comments (0)

No comments yet

Be the first to share your thoughts!