Supervisor Alignment Framework: Enhancing LLM Alignment with Query-Ignoring Strategy and Multi-Agent Interaction

Ziqun Bao, Yu Ji, Wen Wu, Xi Chen, Liang He

Research output: Contribution to journalConference articlepeer-review

Abstract

The increasing focus on value alignment in Large Language Models (LLMs) underscores the need to ensure alignment with human morals and avoid biased or harmful outputs. However, LLMs aligned using existing methods are still easily affected by adversarial prompt attacks. Inspired by psychology, this paper introduces a Supervisor Alignment framework, which innovatively incorporates a query-ignoring strategy. This strategy ensures that the supervisor does not receive user queries, preventing it from being influenced by potential adversarial prompts. Meanwhile, the study compares the efficacy of a single supervisor versus a team of supervisors in value alignment tasks. While our designed single-agent supervisor approach utilizes a standalone agent or integrates with Retrieval-Augmented Generation (RAG) techniques, the team approach we proposed emphasizes multi-agent collaboration through voting, cooperation, and debate strategies. Extensive experiments demonstrate that the Supervisor Alignment framework we designed, incorporating the query-ignoring strategy and multi-agent collaboration, effectively defends against adversarial prompts and enhances its performance in value alignment tasks.

Keywords

  • AI Alignment
  • Adversarial prompts
  • Large Language Model
  • Security

Fingerprint

Dive into the research topics of 'Supervisor Alignment Framework: Enhancing LLM Alignment with Query-Ignoring Strategy and Multi-Agent Interaction'. Together they form a unique fingerprint.

Cite this