TY - JOUR
T1 - Fine-grained detoxification framework via instance-level prefixes for large language models
AU - Yi, Xin
AU - Wang, Linlin
AU - Wang, Xiaoling
AU - He, Liang
N1 - Publisher Copyright:
© 2024 Elsevier B.V.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their practical usability is often compromised by a propensity to generate toxic content, such as insults, threats, and profanity, particularly in response to adversarial prompts. Several fine-tuning and decoding approaches have been employed to address this challenge to mitigate toxicity. Nevertheless, these methods typically necessitate additional resources, such as high-quality training data or auxiliary models, thereby incurring extra costs. In this paper, we propose a novel detoxification framework, Fine-Grained Detoxification via Instance-Level Prefixes (FGDILP), which effectively mitigates the generation of toxic text without incurring additional training costs. Specifically, FGDILP leverages contextualized representations in attention space by contrasting a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This methodology facilitates the construction of fine-grained subtoxicity vectors, which are subsequently fused to adjust the original generation pathway when responding to raw prompts. Our results demonstrate that FGDILP enables controlled text generation concerning detoxification at both the utterance and context levels. While our method slightly impacts generation fluency and diversity, it significantly outperforms prompt-based baselines regarding detoxification effectiveness. Our code is available at https://github.com/xinykou/FGDILP.
AB - Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their practical usability is often compromised by a propensity to generate toxic content, such as insults, threats, and profanity, particularly in response to adversarial prompts. Several fine-tuning and decoding approaches have been employed to address this challenge to mitigate toxicity. Nevertheless, these methods typically necessitate additional resources, such as high-quality training data or auxiliary models, thereby incurring extra costs. In this paper, we propose a novel detoxification framework, Fine-Grained Detoxification via Instance-Level Prefixes (FGDILP), which effectively mitigates the generation of toxic text without incurring additional training costs. Specifically, FGDILP leverages contextualized representations in attention space by contrasting a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This methodology facilitates the construction of fine-grained subtoxicity vectors, which are subsequently fused to adjust the original generation pathway when responding to raw prompts. Our results demonstrate that FGDILP enables controlled text generation concerning detoxification at both the utterance and context levels. While our method slightly impacts generation fluency and diversity, it significantly outperforms prompt-based baselines regarding detoxification effectiveness. Our code is available at https://github.com/xinykou/FGDILP.
KW - Detoxification framework
KW - Large language model
KW - Safety and security
UR - https://www.scopus.com/pages/publications/85205736862
U2 - 10.1016/j.neucom.2024.128684
DO - 10.1016/j.neucom.2024.128684
M3 - 文章
AN - SCOPUS:85205736862
SN - 0925-2312
VL - 611
JO - Neurocomputing
JF - Neurocomputing
M1 - 128684
ER -