跳到主要导航 跳到搜索 跳到主要内容

Supervisor Alignment Framework: Enhancing LLM Alignment with Query-Ignoring Strategy and Multi-Agent Interaction

  • East China Normal University
  • Shanghai University of Engineering Science
  • Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention

科研成果: 期刊稿件会议文章同行评审

摘要

The increasing focus on value alignment in Large Language Models (LLMs) underscores the need to ensure alignment with human morals and avoid biased or harmful outputs. However, LLMs aligned using existing methods are still easily affected by adversarial prompt attacks. Inspired by psychology, this paper introduces a Supervisor Alignment framework, which innovatively incorporates a query-ignoring strategy. This strategy ensures that the supervisor does not receive user queries, preventing it from being influenced by potential adversarial prompts. Meanwhile, the study compares the efficacy of a single supervisor versus a team of supervisors in value alignment tasks. While our designed single-agent supervisor approach utilizes a standalone agent or integrates with Retrieval-Augmented Generation (RAG) techniques, the team approach we proposed emphasizes multi-agent collaboration through voting, cooperation, and debate strategies. Extensive experiments demonstrate that the Supervisor Alignment framework we designed, incorporating the query-ignoring strategy and multi-agent collaboration, effectively defends against adversarial prompts and enhances its performance in value alignment tasks.

指纹

探究 'Supervisor Alignment Framework: Enhancing LLM Alignment with Query-Ignoring Strategy and Multi-Agent Interaction' 的科研主题。它们共同构成独一无二的指纹。

引用此