Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang*, Xiaoling Wang, Liang He

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Ensuring safety alignment is a critical requirement for large language models (LLMs), particularly as they are increasingly deployed in real-world applications. Despite considerable advancements, LLMs remain vulnerable to jailbreak attacks that bypass safety measures and elicit harmful outputs. Adversarial training has shown potential as a defense but often leads to over-defense, where benign inputs are excessively refused, thereby compromising model usability. To tackle these dual challenges, we propose LATPC, a novel framework that integrates Latent-space Adversarial Training with Post-aware Calibration. LATPC selectively identifies safety-critical latent dimensions by contrasting harmful and benign queries, enabling mask-based refusal feature removal attacks followed by adversarial training. During inference, an efficient embedding-level calibration mechanism mitigates over-defense by aligning pseudo-harmful embeddings with their harmless counterparts. Experiments on representative jailbreak attack types show that LATPC achieves a superior balance between safety and utility compared to existing defense frameworks. Notably, it reduces attack success rate to 0 % on HumanJailbreaks and GPTFUZZER. Compared to the state-of-the-art adversarial training baseline, LATPC further reduces the over-refusal rate from 29.2 % to 26.2 %. Our code is publicly available at https://github.com/xinykou/Against_Jailbreak.

Original languageEnglish
Article number129101
JournalExpert Systems with Applications
Volume296
DOIs
StatePublished - 15 Jan 2026

Keywords

  • Jailbreak attacks
  • Large language model
  • Safety alignment

Fingerprint

Dive into the research topics of 'Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks'. Together they form a unique fingerprint.

Cite this