TY - GEN
T1 - BELLE
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Zhang, Taolin
AU - Li, Dongyang
AU - Chen, Qizhou
AU - Wang, Chengyu
AU - He, Xiaofeng
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, categorizing questions into four types and evaluating five types of cutting-edge methods: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions exhibit varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, with each type of method regarded as an “operator” by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to formulate an executable plan of combined “operators” to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.
AB - Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, categorizing questions into four types and evaluating five types of cutting-edge methods: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions exhibit varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, with each type of method regarded as an “operator” by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to formulate an executable plan of combined “operators” to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.
UR - https://www.scopus.com/pages/publications/105021023696
M3 - 会议稿件
AN - SCOPUS:105021023696
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 4184
EP - 4202
BT - Long Papers
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -