Fine-grained detoxification framework via instance-level prefixes for large language models

Research output: Contribution to journalArticlepeer-review

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their practical usability is often compromised by a propensity to generate toxic content, such as insults, threats, and profanity, particularly in response to adversarial prompts. Several fine-tuning and decoding approaches have been employed to address this challenge to mitigate toxicity. Nevertheless, these methods typically necessitate additional resources, such as high-quality training data or auxiliary models, thereby incurring extra costs. In this paper, we propose a novel detoxification framework, Fine-Grained Detoxification via Instance-Level Prefixes (FGDILP), which effectively mitigates the generation of toxic text without incurring additional training costs. Specifically, FGDILP leverages contextualized representations in attention space by contrasting a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This methodology facilitates the construction of fine-grained subtoxicity vectors, which are subsequently fused to adjust the original generation pathway when responding to raw prompts. Our results demonstrate that FGDILP enables controlled text generation concerning detoxification at both the utterance and context levels. While our method slightly impacts generation fluency and diversity, it significantly outperforms prompt-based baselines regarding detoxification effectiveness. Our code is available at https://github.com/xinykou/FGDILP.

Original languageEnglish
Article number128684
JournalNeurocomputing
Volume611
DOIs
StatePublished - 1 Jan 2025

Keywords

  • Detoxification framework
  • Large language model
  • Safety and security

Fingerprint

Dive into the research topics of 'Fine-grained detoxification framework via instance-level prefixes for large language models'. Together they form a unique fingerprint.

Cite this