Towards Reasoning Ability in Scene Text Visual Question Answering

Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin*, Hao He

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Works on scene text visual question answering (TextVQA) always emphasize the importance of reasoning questions and image contents. However, we find current TextVQA models lack reasoning ability and tend to answer questions by exploiting dataset bias and language priors. Moreover, our observations indicate that recent accuracy improvement in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies and more Transformer layers, instead of newly proposed networks. In this work, towards the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based explainability method to explore why TextVQA models answer what they answer and find evidence for their predictions; 3) perform qualitative experiments to visually analyze models reasoning ability and explore potential reasons behind such a poor ability.

Original languageEnglish
Title of host publicationMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages2281-2289
Number of pages9
ISBN (Electronic)9781450386517
DOIs
StatePublished - 17 Oct 2021
Event29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, China
Duration: 20 Oct 202124 Oct 2021

Publication series

NameMM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

Conference

Conference29th ACM International Conference on Multimedia, MM 2021
Country/TerritoryChina
CityVirtual, Online
Period20/10/2124/10/21

Keywords

  • TextVQA
  • explainability method
  • quantitatively and qualitative analysis
  • reasoning ability

Fingerprint

Dive into the research topics of 'Towards Reasoning Ability in Scene Text Visual Question Answering'. Together they form a unique fingerprint.

Cite this