Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision

  • Siyuan Shen
  • , Feng Liu*
  • , Hanyang Wang
  • , Aimin Zhou*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.

Original languageEnglish
Pages (from-to)2261-2273
Number of pages13
JournalIEEE Transactions on Affective Computing
Volume16
Issue number3
DOIs
StatePublished - 2025

Keywords

  • Emotion recognition in conversations
  • contrastive learning
  • deep supervision
  • speaker diarization

Fingerprint

Dive into the research topics of 'Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision'. Together they form a unique fingerprint.

Cite this