A safety realignment framework via subspace-oriented model fusion for large language models

Xin Yi, Shunfan Zheng, Linlin Wang, Xiaoling Wang, Liang He

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model's safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at https://github.com/xinykou/safety_realignment.

Original languageEnglish
Article number112701
JournalKnowledge-Based Systems
Volume306
DOIs
StatePublished - 20 Dec 2024

Keywords

  • Large language model
  • Model fusion
  • Safety alignment

Fingerprint

Dive into the research topics of 'A safety realignment framework via subspace-oriented model fusion for large language models'. Together they form a unique fingerprint.

Cite this