Skip to main navigation Skip to search Skip to main content

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs

  • Jiawei Chen
  • , Xiao Yang
  • , Zhengwei Fang
  • , Yu Tian
  • , Yinpeng Dong
  • , Zhaoxia Yin*
  • , Hang Su*
  • *Corresponding author for this work
  • East China Normal University
  • Tsinghua University
  • RealAI
  • Zhongguancun Laboratory

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.

Original languageEnglish
Title of host publication2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Subtitle of host publicationProceedings of the Conference Findings, NAACL 2025
EditorsLuis Chiruzzo, Alan Ritter, Lu Wang
PublisherAssociation for Computational Linguistics (ACL)
Pages6792-6813
Number of pages22
ISBN (Electronic)9798891761957
DOIs
StatePublished - 2025
Event2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025 - Albuquerque, United States
Duration: 29 Apr 20254 May 2025

Publication series

Name2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025

Conference

Conference2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
Country/TerritoryUnited States
CityAlbuquerque
Period29/04/254/05/25

Fingerprint

Dive into the research topics of 'AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs'. Together they form a unique fingerprint.

Cite this