Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition

  • Siyuan Shen
  • , Feng Liu
  • , Hanyang Wang
  • , Yunlong Wang
  • , Aimin Zhou*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.

Original languageEnglish
Article number0073
JournalIntelligent Computing
Volume3
DOIs
StatePublished - 2024

Fingerprint

Dive into the research topics of 'Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition'. Together they form a unique fingerprint.

Cite this