跳到主要导航 跳到搜索 跳到主要内容

AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs

  • Jiawei Chen
  • , Xiao Yang
  • , Zhengwei Fang
  • , Yu Tian
  • , Yinpeng Dong
  • , Zhaoxia Yin*
  • , Hang Su*
  • *此作品的通讯作者
  • East China Normal University
  • Tsinghua University
  • RealAI
  • Zhongguancun Laboratory

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Recent studies show that large language models (LLMs) are vulnerable to jailbreak attacks, which can bypass their defense mechanisms. However, existing jailbreak research often exhibits limitations in universality, validity, and efficiency. Therefore, we rethink jailbreaking LLMs and define three key properties to guide the design of effective jailbreak methods. We introduce AutoBreach, a novel black-box approach that uses wordplay-guided mapping rule sampling to create universal adversarial prompts. By leveraging LLMs’ summarization and reasoning abilities, AutoBreach minimizes manual effort. To boost jailbreak success rates, we further suggest sentence compression and chain-of-thought-based mapping rules to correct errors and wordplay misinterpretations in target LLMs. Also, we propose a two-stage mapping rule optimization that initially optimizes mapping rules before querying target LLMs to enhance efficiency. Experimental results indicate AutoBreach efficiently identifies security vulnerabilities across various LLMs (Claude-3, GPT-4, etc.), achieving an average success rate of over 80% with fewer than 10 queries. Notably, the adversarial prompts generated by AutoBreach for GPT-4 can directly bypass the defenses of the advanced commercial LLM GPT o1-preview, demonstrating strong transferability and universality.

源语言英语
主期刊名2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
主期刊副标题Proceedings of the Conference Findings, NAACL 2025
编辑Luis Chiruzzo, Alan Ritter, Lu Wang
出版商Association for Computational Linguistics (ACL)
6792-6813
页数22
ISBN(电子版)9798891761957
DOI
出版状态已出版 - 2025
活动2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025 - Albuquerque, 美国
期限: 29 4月 20254 5月 2025

出版系列

姓名2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025

会议

会议2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
国家/地区美国
Albuquerque
时期29/04/254/05/25

指纹

探究 'AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization via Multi-LLMs' 的科研主题。它们共同构成独一无二的指纹。

引用此