Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, Text-to-Music (T2M) generation models have rapidly emerged as powerful tools in content creation across fields. While existing models have made notable progress in sound quality, instrument identification, and stylistic alignment, they still exhibit clear limitations in modeling musical structure and musicality-particularly in terms of harmonic coherence and rhythmic alignment. To address these issues, we propose a Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation(TCSA), which introduces explicit local condition controls to enhance structural fidelity in music generation. Specifically, we design a music theory enrichment strategy based on GPT-2 that transforms input text into detailed descriptions with embedded music theory knowledge, from which accurate chord progressions and rhythmic patterns are extracted as generation conditions. To synchronize these local features effectively, we develop a temporal alignment feature fusion mechanism. Additionally, we propose a layer-skipping fine-tuning strategy to avoid overfitting and enable fine-grained structural modeling. Finally, we introduce a perception-driven loss function based on Mel spectrograms to optimize the harmonic consistency and structural coherence of the generated music. Experimental results demonstrate that TCSA achieves competitive generation quality while offering significantly improved controllability over musical structure, making it well-suited for professional music production and refined content creation.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages10728-10737
Number of pages10
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • aigc
  • diffusion model
  • multi-modal learning
  • text-to-music

Fingerprint

Dive into the research topics of 'Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation'. Together they form a unique fingerprint.

Cite this