ACRA: An adaptive chain retrieval architecture for multi-modal knowledge-Augmented visual question answering

Research output: Contribution to journalArticlepeer-review

Abstract

Visual question answering (VQA) in knowledge-intensive scenarios requires integrating of external knowledge to bridge the semantic gap between shallow linguistic queries and complex reasoning requirements. However, existing methods typically rely on single-hop retrieval strategies, which are prone to overlooking intermediate facts essential for accurate reasoning. To address this limitation, we propose adaptive chain retrieval architecture (ACRA), a novel multi-hop retrieval framework based on large-model-generated evidence chain annotations. ACRA constructs structured reasoning paths by progressively selecting key evidence nodes using an adaptive matching mechanism based on an encoder-only transformer. To improve evidence discrimination, we design a hybrid loss optimization strategy that incorporates dynamically mined hard negatives, combining binary cross-entropy and margin-based ranking loss. Furthermore, we introduce a depth-aware adaptive beam search algorithm that models evidence retrieval as a sequential process, gradually increasing the matching threshold with search depth to suppress irrelevant content while maintaining logical coherence. We evaluate ACRA on the WebQA and MultimodalQA. ACRA achieves 55.4 % QA accuracy and 90.2 % F1 score on WebQA, and 78.8 % EM and 82.4 % F1 on MultimodalQA. Experimental results show that ACRA consistently outperforms state-of-the-art baselines in terms of retrieval accuracy and reasoning consistency, demonstrating its effectiveness in mitigating cognitive biases and improving multi-hop reasoning in VQA tasks.

Original languageEnglish
Article number115136
JournalKnowledge-Based Systems
Volume334
DOIs
StatePublished - 15 Feb 2026

Keywords

  • Deep learning
  • Document layout analysis
  • Dynamic residual feature fusion
  • Information extraction
  • Information understanding

Fingerprint

Dive into the research topics of 'ACRA: An adaptive chain retrieval architecture for multi-modal knowledge-Augmented visual question answering'. Together they form a unique fingerprint.

Cite this