Abstract
To leverage the abundant speech data available for pretraining, current models excel in generalization across diverse tasks. Nevertheless, real-world challenges emerge when addressing unfamiliar speech scenarios far from the pretrained speech, owing to the domain shift between pretraining (source) and fine-tuning (target) data. To overcome this barrier, we propose a variational Bayesian adapter (VB-Adapter) for cross-domain speech representation learning during fine-tuning. First, we establish a latent variable model to construct a desired posterior distribution after incorporating domain-specific knowledge to bridge the gap between the source and target domains. Then, an adaptive objective is presented to maximize the mutual information of the latent variables with and without domain-specific knowledge to facilitate model adaptation. Finally, we introduce contrastive learning on samples to optimize the lower bound of the above adaptive objective. Our experiments apply the VB-Adapter on transformers for dysarthric speech recognition (DSR) and the integration of Whisper-encoder and Llama for Mandarin speech recognition (MSR). The results reveal the effectiveness of VB-Adapter in modeling the uncertainties arising from domain shift and enhancing the robustness of speech representations.
| Original language | English |
|---|---|
| Pages (from-to) | 18300-18311 |
| Number of pages | 12 |
| Journal | IEEE Transactions on Neural Networks and Learning Systems |
| Volume | 36 |
| Issue number | 10 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Domain adaptation
- speech recognition
- variational Bayesian transformer