TY - JOUR
T1 - ACRA
T2 - An adaptive chain retrieval architecture for multi-modal knowledge-Augmented visual question answering
AU - Zhang, Zihao
AU - Yang, Shuwen
AU - Wu, Xingjiao
AU - Zhao, Jiabao
AU - Chen, Qin
AU - Yang, Jing
AU - He, Liang
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/2/15
Y1 - 2026/2/15
N2 - Visual question answering (VQA) in knowledge-intensive scenarios requires integrating of external knowledge to bridge the semantic gap between shallow linguistic queries and complex reasoning requirements. However, existing methods typically rely on single-hop retrieval strategies, which are prone to overlooking intermediate facts essential for accurate reasoning. To address this limitation, we propose adaptive chain retrieval architecture (ACRA), a novel multi-hop retrieval framework based on large-model-generated evidence chain annotations. ACRA constructs structured reasoning paths by progressively selecting key evidence nodes using an adaptive matching mechanism based on an encoder-only transformer. To improve evidence discrimination, we design a hybrid loss optimization strategy that incorporates dynamically mined hard negatives, combining binary cross-entropy and margin-based ranking loss. Furthermore, we introduce a depth-aware adaptive beam search algorithm that models evidence retrieval as a sequential process, gradually increasing the matching threshold with search depth to suppress irrelevant content while maintaining logical coherence. We evaluate ACRA on the WebQA and MultimodalQA. ACRA achieves 55.4 % QA accuracy and 90.2 % F1 score on WebQA, and 78.8 % EM and 82.4 % F1 on MultimodalQA. Experimental results show that ACRA consistently outperforms state-of-the-art baselines in terms of retrieval accuracy and reasoning consistency, demonstrating its effectiveness in mitigating cognitive biases and improving multi-hop reasoning in VQA tasks.
AB - Visual question answering (VQA) in knowledge-intensive scenarios requires integrating of external knowledge to bridge the semantic gap between shallow linguistic queries and complex reasoning requirements. However, existing methods typically rely on single-hop retrieval strategies, which are prone to overlooking intermediate facts essential for accurate reasoning. To address this limitation, we propose adaptive chain retrieval architecture (ACRA), a novel multi-hop retrieval framework based on large-model-generated evidence chain annotations. ACRA constructs structured reasoning paths by progressively selecting key evidence nodes using an adaptive matching mechanism based on an encoder-only transformer. To improve evidence discrimination, we design a hybrid loss optimization strategy that incorporates dynamically mined hard negatives, combining binary cross-entropy and margin-based ranking loss. Furthermore, we introduce a depth-aware adaptive beam search algorithm that models evidence retrieval as a sequential process, gradually increasing the matching threshold with search depth to suppress irrelevant content while maintaining logical coherence. We evaluate ACRA on the WebQA and MultimodalQA. ACRA achieves 55.4 % QA accuracy and 90.2 % F1 score on WebQA, and 78.8 % EM and 82.4 % F1 on MultimodalQA. Experimental results show that ACRA consistently outperforms state-of-the-art baselines in terms of retrieval accuracy and reasoning consistency, demonstrating its effectiveness in mitigating cognitive biases and improving multi-hop reasoning in VQA tasks.
KW - Deep learning
KW - Document layout analysis
KW - Dynamic residual feature fusion
KW - Information extraction
KW - Information understanding
UR - https://www.scopus.com/pages/publications/105025193434
U2 - 10.1016/j.knosys.2025.115136
DO - 10.1016/j.knosys.2025.115136
M3 - 文章
AN - SCOPUS:105025193434
SN - 0950-7051
VL - 334
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 115136
ER -