Abstract
The increasing focus on value alignment in Large Language Models (LLMs) underscores the need to ensure alignment with human morals and avoid biased or harmful outputs. However, LLMs aligned using existing methods are still easily affected by adversarial prompt attacks. Inspired by psychology, this paper introduces a Supervisor Alignment framework, which innovatively incorporates a query-ignoring strategy. This strategy ensures that the supervisor does not receive user queries, preventing it from being influenced by potential adversarial prompts. Meanwhile, the study compares the efficacy of a single supervisor versus a team of supervisors in value alignment tasks. While our designed single-agent supervisor approach utilizes a standalone agent or integrates with Retrieval-Augmented Generation (RAG) techniques, the team approach we proposed emphasizes multi-agent collaboration through voting, cooperation, and debate strategies. Extensive experiments demonstrate that the Supervisor Alignment framework we designed, incorporating the query-ignoring strategy and multi-agent collaboration, effectively defends against adversarial prompts and enhances its performance in value alignment tasks.
| Original language | English |
|---|---|
| Journal | Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing |
| DOIs | |
| State | Published - 2025 |
| Event | 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India Duration: 6 Apr 2025 → 11 Apr 2025 |
Keywords
- AI Alignment
- Adversarial prompts
- Large Language Model
- Security