ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity
ProSarc is an audio-only artificial intelligence framework designed to identify sarcasm in spoken language by analyzing how prosody—the rhythm, stress, and intonation of speech—changes over time. While many existing systems rely on text or visual cues to detect sarcasm, ProSarc focuses exclusively on the acoustic signal. It operates on the hypothesis that sarcasm is characterized by a "temporal prosodic incongruity," or a mismatch between the speaker's local, moment-to-moment vocal dynamics and their overall emotional baseline for the entire utterance.
How the Framework Works
The model processes audio through two parallel paths to capture different aspects of speech. The "Global Emotion Encoder" calculates utterance-level statistics, such as average pitch, energy, and speaking rate, to establish a baseline for the speaker's emotional tone. Simultaneously, the "Temporal Prosody Encoder" uses a self-supervised model (such as WavLM or HuBERT) combined with a bidirectional LSTM to track how prosodic features shift frame-by-frame.
A "Prosodic Incongruity Analyzer" then compares these two paths. It produces a scalar score that measures the degree of mismatch between the local dynamics and the global baseline. This score acts as a gate, adaptively weighting the information from both paths to make a final classification. Additionally, the model uses an attention-based mechanism to identify the specific moments in an utterance where sarcastic cues are most likely occurring, without needing frame-by-frame human labels.
Measuring Uncertainty and Confidence
ProSarc incorporates Monte Carlo dropout to provide uncertainty estimates alongside its predictions. By running the model multiple times with different internal connections "dropped" during inference, the system can calculate a variance score. High variance indicates that the model is uncertain about its classification, which the researchers found aligns with human-perceived ambiguity in speech. This allows the system to signal when a sarcastic utterance is particularly difficult to interpret, providing a layer of transparency to its decision-making process.
Performance and Generalization
The framework was evaluated across four different datasets, including scripted television dialogue (MUStARD and MUStARD++), spontaneous podcast speech (PodSarc), and cross-lingual German speech (MuSaG). ProSarc consistently outperformed previous audio-only methods across all benchmarks. Statistical validation, including a Wilcoxon signed-rank test, confirmed that the model’s focus on incongruity significantly improves detection accuracy. Exploratory analysis also revealed that the model tends to identify sarcastic onsets in the latter portion of an utterance, suggesting a distinct temporal pattern for sarcastic speech compared to sincere, non-sarcastic utterances.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!