TY - GEN
T1 - Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation
AU - Zhang, Zihao
AU - Wu, Xingjiao
AU - Xu, Junjie
AU - Ma, Tianlong
AU - Yao, Tangren
AU - Wu, Wen
AU - He, Liang
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - In recent years, Text-to-Music (T2M) generation models have rapidly emerged as powerful tools in content creation across fields. While existing models have made notable progress in sound quality, instrument identification, and stylistic alignment, they still exhibit clear limitations in modeling musical structure and musicality-particularly in terms of harmonic coherence and rhythmic alignment. To address these issues, we propose a Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation(TCSA), which introduces explicit local condition controls to enhance structural fidelity in music generation. Specifically, we design a music theory enrichment strategy based on GPT-2 that transforms input text into detailed descriptions with embedded music theory knowledge, from which accurate chord progressions and rhythmic patterns are extracted as generation conditions. To synchronize these local features effectively, we develop a temporal alignment feature fusion mechanism. Additionally, we propose a layer-skipping fine-tuning strategy to avoid overfitting and enable fine-grained structural modeling. Finally, we introduce a perception-driven loss function based on Mel spectrograms to optimize the harmonic consistency and structural coherence of the generated music. Experimental results demonstrate that TCSA achieves competitive generation quality while offering significantly improved controllability over musical structure, making it well-suited for professional music production and refined content creation.
AB - In recent years, Text-to-Music (T2M) generation models have rapidly emerged as powerful tools in content creation across fields. While existing models have made notable progress in sound quality, instrument identification, and stylistic alignment, they still exhibit clear limitations in modeling musical structure and musicality-particularly in terms of harmonic coherence and rhythmic alignment. To address these issues, we propose a Temporal-Conditioned Symbolic Alignment for Controllable Text-to-Music Generation(TCSA), which introduces explicit local condition controls to enhance structural fidelity in music generation. Specifically, we design a music theory enrichment strategy based on GPT-2 that transforms input text into detailed descriptions with embedded music theory knowledge, from which accurate chord progressions and rhythmic patterns are extracted as generation conditions. To synchronize these local features effectively, we develop a temporal alignment feature fusion mechanism. Additionally, we propose a layer-skipping fine-tuning strategy to avoid overfitting and enable fine-grained structural modeling. Finally, we introduce a perception-driven loss function based on Mel spectrograms to optimize the harmonic consistency and structural coherence of the generated music. Experimental results demonstrate that TCSA achieves competitive generation quality while offering significantly improved controllability over musical structure, making it well-suited for professional music production and refined content creation.
KW - aigc
KW - diffusion model
KW - multi-modal learning
KW - text-to-music
UR - https://www.scopus.com/pages/publications/105024061180
U2 - 10.1145/3746027.3754812
DO - 10.1145/3746027.3754812
M3 - 会议稿件
AN - SCOPUS:105024061180
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 10728
EP - 10737
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM International Conference on Multimedia, MM 2025
Y2 - 27 October 2025 through 31 October 2025
ER -