TY - GEN
T1 - AgentStory
T2 - 2025 International Conference on Multimedia Retrieval, ICMR 2025
AU - Zhou, Tianchen
AU - Duan, Zhongjie
AU - Chen, Cen
AU - Zhou, Wenmeng
AU - Wang, Yanhao
AU - Li, Yaliang
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/6/30
Y1 - 2025/6/30
N2 - Story visualization aims to create visual content, such as images and videos, that is consistent, coherent, and complete with a given story. Despite significant advances in the application of diffusion models for general text-to-image generation tasks, they still encounter difficulties when directly used to produce consistent visual content that accurately aligns with the narrative text. In this paper, we propose a novel training-free automated story visualization framework called AgentStory that can generate image illustrations based on a story synopsis provided by users. Specifically, the framework employs multiple agents empowered by Large Language Models (LLMs) to create detailed descriptions of each subject and scene in the entire story. Then, it integrates a masking mechanism with a fine-grained consistency refinement adapter to incorporate different subjects in a scene. Furthermore, it utilizes the visual understanding capabilities of multimodal LLMs to include detailed features of different subjects in the refinement adapter, thus improving the consistency of each subject across multiple scenes. Finally, we compare the AgentStory framework with state-of-the-art baselines for story visualization on the DS-500 dataset and demonstrate its superior performance in terms of subject consistency, text-image alignment, and aesthetic quality. Our code is publicly available at https://github.com/tc2000731/AgentStory.
AB - Story visualization aims to create visual content, such as images and videos, that is consistent, coherent, and complete with a given story. Despite significant advances in the application of diffusion models for general text-to-image generation tasks, they still encounter difficulties when directly used to produce consistent visual content that accurately aligns with the narrative text. In this paper, we propose a novel training-free automated story visualization framework called AgentStory that can generate image illustrations based on a story synopsis provided by users. Specifically, the framework employs multiple agents empowered by Large Language Models (LLMs) to create detailed descriptions of each subject and scene in the entire story. Then, it integrates a masking mechanism with a fine-grained consistency refinement adapter to incorporate different subjects in a scene. Furthermore, it utilizes the visual understanding capabilities of multimodal LLMs to include detailed features of different subjects in the refinement adapter, thus improving the consistency of each subject across multiple scenes. Finally, we compare the AgentStory framework with state-of-the-art baselines for story visualization on the DS-500 dataset and demonstrate its superior performance in terms of subject consistency, text-image alignment, and aesthetic quality. Our code is publicly available at https://github.com/tc2000731/AgentStory.
KW - diffusion models
KW - llm agents
KW - multimodal llms
KW - story visualization
KW - text-to-image generation
UR - https://www.scopus.com/pages/publications/105011591311
U2 - 10.1145/3731715.3733271
DO - 10.1145/3731715.3733271
M3 - 会议稿件
AN - SCOPUS:105011591311
T3 - ICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
SP - 1894
EP - 1902
BT - ICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
PB - Association for Computing Machinery, Inc
Y2 - 30 June 2025 through 3 July 2025
ER -