TY - GEN
T1 - Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions
AU - Li, Zhiwen
AU - Chen, Die
AU - Fan, Mingyuan
AU - Chen, Cen
AU - Li, Yaliang
AU - Wang, Yanhao
AU - Zhou, Wenmeng
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/4/28
Y1 - 2025/4/28
N2 - The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.
AB - The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.
KW - AI Fairness
KW - AI Safety
KW - Diffusion Model
KW - Responsible Generative AI
KW - Text-to-Image Generation
UR - https://www.scopus.com/pages/publications/105005141402
U2 - 10.1145/3696410.3714912
DO - 10.1145/3696410.3714912
M3 - 会议稿件
AN - SCOPUS:105005141402
T3 - WWW 2025 - Proceedings of the ACM Web Conference
SP - 1588
EP - 1601
BT - WWW 2025 - Proceedings of the ACM Web Conference
PB - Association for Computing Machinery, Inc
T2 - 34th ACM Web Conference, WWW 2025
Y2 - 28 April 2025 through 2 May 2025
ER -