Abstract
The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and synchronous motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on difFusion model to ensure both the authenticity and diversity of generated motion. We propose the progressive fusion to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints’ velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body. Our code and model will be released at our website.
| Original language | English |
|---|---|
| Article number | 111774 |
| Journal | Pattern Recognition |
| Volume | 169 |
| DOIs | |
| State | Published - Jan 2026 |
Keywords
- Diffusion model
- Human motion synthesis
- Multi-model learning