TY - GEN
T1 - ExVideo
T2 - 34th Internationa Joint Conference on Artificial Intelligence, IJCAI 2025
AU - Duan, Zhongjie
AU - Zhang, Hong
AU - Zhou, Wenmeng
AU - Chen, Cen
AU - Li, Yaliang
AU - Zhang, Yu
AU - Chen, Yingda
N1 - Publisher Copyright:
© 2025 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2025
Y1 - 2025
N2 - Recently, advancements in video synthesis have attracted significant attention. Video synthesis models have demonstrated the practical applicability of diffusion models in creating dynamic visual content. Despite these advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model's capacity to generate up to 5× its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We have released the source code and the enhanced model publicly.
AB - Recently, advancements in video synthesis have attracted significant attention. Video synthesis models have demonstrated the practical applicability of diffusion models in creating dynamic visual content. Despite these advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model's capacity to generate up to 5× its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We have released the source code and the enhanced model publicly.
UR - https://www.scopus.com/pages/publications/105021803493
U2 - 10.24963/ijcai.2025/1118
DO - 10.24963/ijcai.2025/1118
M3 - 会议稿件
AN - SCOPUS:105021803493
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 10063
EP - 10071
BT - Proceedings of the 34th International Joint Conference on Artificial Intelligence, IJCAI 2025
A2 - Kwok, James
PB - International Joint Conferences on Artificial Intelligence
Y2 - 16 August 2025 through 22 August 2025
ER -