AgentStory: A Multi-Agent System for Story Visualization with Multi-Subject Consistent Text-to-Image Generation

  • Tianchen Zhou
  • , Zhongjie Duan
  • , Cen Chen*
  • , Wenmeng Zhou
  • , Yanhao Wang
  • , Yaliang Li
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Story visualization aims to create visual content, such as images and videos, that is consistent, coherent, and complete with a given story. Despite significant advances in the application of diffusion models for general text-to-image generation tasks, they still encounter difficulties when directly used to produce consistent visual content that accurately aligns with the narrative text. In this paper, we propose a novel training-free automated story visualization framework called AgentStory that can generate image illustrations based on a story synopsis provided by users. Specifically, the framework employs multiple agents empowered by Large Language Models (LLMs) to create detailed descriptions of each subject and scene in the entire story. Then, it integrates a masking mechanism with a fine-grained consistency refinement adapter to incorporate different subjects in a scene. Furthermore, it utilizes the visual understanding capabilities of multimodal LLMs to include detailed features of different subjects in the refinement adapter, thus improving the consistency of each subject across multiple scenes. Finally, we compare the AgentStory framework with state-of-the-art baselines for story visualization on the DS-500 dataset and demonstrate its superior performance in terms of subject consistency, text-image alignment, and aesthetic quality. Our code is publicly available at https://github.com/tc2000731/AgentStory.

Original languageEnglish
Title of host publicationICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1894-1902
Number of pages9
ISBN (Electronic)9798400718779
DOIs
StatePublished - 30 Jun 2025
Event2025 International Conference on Multimedia Retrieval, ICMR 2025 - Chicago, United States
Duration: 30 Jun 20253 Jul 2025

Publication series

NameICMR 2025 - Proceedings of the 2025 International Conference on Multimedia Retrieval

Conference

Conference2025 International Conference on Multimedia Retrieval, ICMR 2025
Country/TerritoryUnited States
CityChicago
Period30/06/253/07/25

Keywords

  • diffusion models
  • llm agents
  • multimodal llms
  • story visualization
  • text-to-image generation

Fingerprint

Dive into the research topics of 'AgentStory: A Multi-Agent System for Story Visualization with Multi-Subject Consistent Text-to-Image Generation'. Together they form a unique fingerprint.

Cite this