Perception-Guided Jailbreak Against Text-to-Image Models

  • Yihao Huang
  • , Le Liang*
  • , Tianlin Li
  • , Xiaojun Jia*
  • , Run Wang
  • , Weikai Miao
  • , Geguang Pu
  • , Yang Liu
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

6 Scopus citations

Abstract

In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ. Warning: This paper contains NSFW and disturbing imagery, including adult, violent, and illegal-related contentious content. We have masked images deemed unsafe. However, reader discretion is advised.

Original languageEnglish
Pages (from-to)26238-26247
Number of pages10
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number25
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Fingerprint

Dive into the research topics of 'Perception-Guided Jailbreak Against Text-to-Image Models'. Together they form a unique fingerprint.

Cite this