Semantic Attention and LLM-based Layout Guidance for Text-to-Image Generation

Yuxiang Song, Zhaoguang Long, Man Lan, Changzhi Sun, Aimin Zhou, Yuefeng Chen, Hao Yuan, Fei Cao

Research output: Contribution to journalConference articlepeer-review

Abstract

Diffusion models have substantially advanced text-to-image generation, achieving remarkable performance in creating high-quality images from textual prompts. However, they often struggle with accurately generating images representing spatial locations described or implied in the prompts. To address this, we introduce SALT, a training-free method leveraging semantic attention and layout guidance from Large Language Models (LLMs) for text-to-image generation. This method effectively guides both cross-attention and self-attention layers within diffusion models, steering generation toward the direction of high-attention values provided by the layout guidance. During the denoising process of the diffusion model, image features in the latent space are iteratively refined based on the loss function calculated from the desired attention maps. Our approach has been executed on two benchmarks, providing detailed qualitative examples and comprehensive quantitative analyses. Results demonstrate that SALT outperforms existing training-free methods in controlling object layouts and generating attributes.

Keywords

  • attention mechanism
  • diffusion models
  • text-to-image generation
  • training-free method

Fingerprint

Dive into the research topics of 'Semantic Attention and LLM-based Layout Guidance for Text-to-Image Generation'. Together they form a unique fingerprint.

Cite this