TY - JOUR
T1 - Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition
AU - Shen, Siyuan
AU - Liu, Feng
AU - Wang, Hanyang
AU - Wang, Yunlong
AU - Zhou, Aimin
N1 - Publisher Copyright:
© 2024 Siyuan Shen et al.
PY - 2024
Y1 - 2024
N2 - Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.
AB - Recent advances in self-supervised models have led to effective pretrained speech representations in downstream speech emotion recognition tasks. However, previous research has primarily focused on exploiting pretrained representations by simply adding a linear head on top of the pretrained model, while overlooking the design of the downstream network. In this paper, we propose a temporal shift module with pretrained representations to integrate channel-wise information without introducing additional parameters or floating-point operations per second. By incorporating the temporal shift module, we developed corresponding shift variants for 3 baseline building blocks: ShiftCNN, ShiftLSTM, and Shiftformer. Furthermore, we propose 2 technical strategies, placement and proportion of shift, to balance the trade-off between mingling and misalignment. Our family of temporal shift models outperforms state-of-the-art methods on the benchmark Interactive Emotional Dyadic Motion Capture dataset in fine-tuning and feature-extraction scenarios. In addition, through comprehensive experiments using wav2vec 2.0 and Hidden-Unit Bidirectional Encoder Representations from Transformers representations, we identified the behavior of the temporal shift module in downstream models, which may serve as an empirical guideline for future exploration of channel-wise shift and downstream network design.
UR - https://www.scopus.com/pages/publications/85199612576
U2 - 10.34133/icomputing.0073
DO - 10.34133/icomputing.0073
M3 - 文章
AN - SCOPUS:85199612576
SN - 2771-5892
VL - 3
JO - Intelligent Computing
JF - Intelligent Computing
M1 - 0073
ER -