MMoFusion: Multi-modal co-speech motion generation with diffusion model

  • Sen Wang
  • , Jiangning Zhang
  • , Xin Tan*
  • , Zhifeng Xie
  • , Chengjie Wang
  • , Lizhuang Ma
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and synchronous motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on difFusion model to ensure both the authenticity and diversity of generated motion. We propose the progressive fusion to enhance the interaction of inter-modal and intra-modal, efficiently integrating multi-modal information. Specifically, we employ a masked style matrix based on emotion and identity information to control the generation of different motion styles. Temporal modeling of speech and motion is partitioned into style-guided specific feature encoding and shared feature encoding, aiming to learn both inter-modal and intra-modal features. Besides, we propose a geometric loss to enforce the joints’ velocity and acceleration coherence among frames. Our framework generates vivid, diverse, and style-controllable motion of arbitrary length through inputting speech and editing identity and emotion. Extensive experiments demonstrate that our method outperforms current co-speech motion generation methods including upper body and challenging full body. Our code and model will be released at our website.

Original languageEnglish
Article number111774
JournalPattern Recognition
Volume169
DOIs
StatePublished - Jan 2026

Keywords

  • Diffusion model
  • Human motion synthesis
  • Multi-model learning

Fingerprint

Dive into the research topics of 'MMoFusion: Multi-modal co-speech motion generation with diffusion model'. Together they form a unique fingerprint.

Cite this