TY - JOUR
T1 - Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks
AU - Yi, Xin
AU - Li, Yue
AU - Shi, Dongsheng
AU - Wang, Linlin
AU - Wang, Xiaoling
AU - He, Liang
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2026/1/15
Y1 - 2026/1/15
N2 - Ensuring safety alignment is a critical requirement for large language models (LLMs), particularly as they are increasingly deployed in real-world applications. Despite considerable advancements, LLMs remain vulnerable to jailbreak attacks that bypass safety measures and elicit harmful outputs. Adversarial training has shown potential as a defense but often leads to over-defense, where benign inputs are excessively refused, thereby compromising model usability. To tackle these dual challenges, we propose LATPC, a novel framework that integrates Latent-space Adversarial Training with Post-aware Calibration. LATPC selectively identifies safety-critical latent dimensions by contrasting harmful and benign queries, enabling mask-based refusal feature removal attacks followed by adversarial training. During inference, an efficient embedding-level calibration mechanism mitigates over-defense by aligning pseudo-harmful embeddings with their harmless counterparts. Experiments on representative jailbreak attack types show that LATPC achieves a superior balance between safety and utility compared to existing defense frameworks. Notably, it reduces attack success rate to 0 % on HumanJailbreaks and GPTFUZZER. Compared to the state-of-the-art adversarial training baseline, LATPC further reduces the over-refusal rate from 29.2 % to 26.2 %. Our code is publicly available at https://github.com/xinykou/Against_Jailbreak.
AB - Ensuring safety alignment is a critical requirement for large language models (LLMs), particularly as they are increasingly deployed in real-world applications. Despite considerable advancements, LLMs remain vulnerable to jailbreak attacks that bypass safety measures and elicit harmful outputs. Adversarial training has shown potential as a defense but often leads to over-defense, where benign inputs are excessively refused, thereby compromising model usability. To tackle these dual challenges, we propose LATPC, a novel framework that integrates Latent-space Adversarial Training with Post-aware Calibration. LATPC selectively identifies safety-critical latent dimensions by contrasting harmful and benign queries, enabling mask-based refusal feature removal attacks followed by adversarial training. During inference, an efficient embedding-level calibration mechanism mitigates over-defense by aligning pseudo-harmful embeddings with their harmless counterparts. Experiments on representative jailbreak attack types show that LATPC achieves a superior balance between safety and utility compared to existing defense frameworks. Notably, it reduces attack success rate to 0 % on HumanJailbreaks and GPTFUZZER. Compared to the state-of-the-art adversarial training baseline, LATPC further reduces the over-refusal rate from 29.2 % to 26.2 %. Our code is publicly available at https://github.com/xinykou/Against_Jailbreak.
KW - Jailbreak attacks
KW - Large language model
KW - Safety alignment
UR - https://www.scopus.com/pages/publications/105011857206
U2 - 10.1016/j.eswa.2025.129101
DO - 10.1016/j.eswa.2025.129101
M3 - 文章
AN - SCOPUS:105011857206
SN - 0957-4174
VL - 296
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 129101
ER -