TY - JOUR
T1 - Semantic Attention and LLM-based Layout Guidance for Text-to-Image Generation
AU - Song, Yuxiang
AU - Long, Zhaoguang
AU - Lan, Man
AU - Sun, Changzhi
AU - Zhou, Aimin
AU - Chen, Yuefeng
AU - Yuan, Hao
AU - Cao, Fei
N1 - Publisher Copyright:
©2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Diffusion models have substantially advanced text-to-image generation, achieving remarkable performance in creating high-quality images from textual prompts. However, they often struggle with accurately generating images representing spatial locations described or implied in the prompts. To address this, we introduce SALT, a training-free method leveraging semantic attention and layout guidance from Large Language Models (LLMs) for text-to-image generation. This method effectively guides both cross-attention and self-attention layers within diffusion models, steering generation toward the direction of high-attention values provided by the layout guidance. During the denoising process of the diffusion model, image features in the latent space are iteratively refined based on the loss function calculated from the desired attention maps. Our approach has been executed on two benchmarks, providing detailed qualitative examples and comprehensive quantitative analyses. Results demonstrate that SALT outperforms existing training-free methods in controlling object layouts and generating attributes.
AB - Diffusion models have substantially advanced text-to-image generation, achieving remarkable performance in creating high-quality images from textual prompts. However, they often struggle with accurately generating images representing spatial locations described or implied in the prompts. To address this, we introduce SALT, a training-free method leveraging semantic attention and layout guidance from Large Language Models (LLMs) for text-to-image generation. This method effectively guides both cross-attention and self-attention layers within diffusion models, steering generation toward the direction of high-attention values provided by the layout guidance. During the denoising process of the diffusion model, image features in the latent space are iteratively refined based on the loss function calculated from the desired attention maps. Our approach has been executed on two benchmarks, providing detailed qualitative examples and comprehensive quantitative analyses. Results demonstrate that SALT outperforms existing training-free methods in controlling object layouts and generating attributes.
KW - attention mechanism
KW - diffusion models
KW - text-to-image generation
KW - training-free method
UR - https://www.scopus.com/pages/publications/105009596761
U2 - 10.1109/ICASSP49660.2025.10890155
DO - 10.1109/ICASSP49660.2025.10890155
M3 - 会议文章
AN - SCOPUS:105009596761
SN - 0736-7791
JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
T2 - 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Y2 - 6 April 2025 through 11 April 2025
ER -