TY - GEN
T1 - AutoBreach
T2 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
AU - Chen, Jiawei
AU - Yang, Xiao
AU - Fang, Zhengwei
AU - Tian, Yu
AU - Dong, Yinpeng
AU - Yin, Zhaoxia
AU - Su, Hang
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.
AB - Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.
UR - https://www.scopus.com/pages/publications/105028711494
U2 - 10.18653/v1/2025.findings-naacl.378
DO - 10.18653/v1/2025.findings-naacl.378
M3 - 会议稿件
AN - SCOPUS:105028711494
T3 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025
SP - 6792
EP - 6813
BT - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
A2 - Chiruzzo, Luis
A2 - Ritter, Alan
A2 - Wang, Lu
PB - Association for Computational Linguistics (ACL)
Y2 - 29 April 2025 through 4 May 2025
ER -