TY - JOUR
T1 - Towards Speaker-Unknown Emotion Recognition in Conversation via Progressive Contrastive Deep Supervision
AU - Shen, Siyuan
AU - Liu, Feng
AU - Wang, Hanyang
AU - Zhou, Aimin
N1 - Publisher Copyright:
© 2010-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.
AB - Emotion recognition in conversation has attained increasing attention for perceiving user emotion in practical conversational applications. Conversational utterances spoken alternately by different speakers inspire most studies to leverage speaker information based on golden speaker labels. In this work, we challenge the existing paradigm of utilizing available speaker labels with a more realistic scenario, where the speaker identity of each utterance is unknown during inference. We propose Progressive Contrastive Deep Supervision for multimodal emotion recognition in conversation (PCDS), incorporating speaker diarization and emotion recognition into one unified framework. To facilitate joint task learning, we inject speaker and emotion bias into the network progressively via contrastive deep supervision, with the task-irrelevant contrast being the intermediate transition. To obtain explicit speaker dependency, we propose a speaker contrast and clustering module (SCC) to endow the capability of partitioning speakers into groups even when neither speaker label nor number of speakers is known as a priori. Experiments on two ERC benchmarks, including IEMOCAP and MELD demonstrate the effectiveness of the proposed method. We also show that progressive contrastive deep supervision helps reconcile the underlying tension between speaker diarization and emotion recognition.
KW - Emotion recognition in conversations
KW - contrastive learning
KW - deep supervision
KW - speaker diarization
UR - https://www.scopus.com/pages/publications/105002119225
U2 - 10.1109/TAFFC.2025.3558222
DO - 10.1109/TAFFC.2025.3558222
M3 - 文章
AN - SCOPUS:105002119225
SN - 1949-3045
VL - 16
SP - 2261
EP - 2273
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
IS - 3
ER -