Image Alone Are Not Enough: A General Semantic-Augmented Transformer-Based Framework for Image Captioning

Jiawei Liu, Xin Lin*, Liang He

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Image captioning has long been widely regarded as a modal transformation task from visual to linguistic modality. Most current research focuses on the information transformation between single modalities dominated by visual features, while less attention is paid to the interaction between visual features and linguistic features. This rigid single-modal conversion method is prone to information confusion and loss during the conversion process, making it difficult for the model to generate accurate and detailed captions. In this paper, we propose a general Semantic-Augmented Transformer-Based (SAT) framework to facilitate smoother transformation between the two modalities. In the encoding stage, we use the fine-grained description of each region to fuse with the corresponding image features to make the image feature representation closer to the text feature representation. In the decoding stage, the caption's part-of-speech information is used as prior knowledge to constrain the model to pay more attention to the details in the image rather than only to the prominent entities for fine-grained captions. We extensively evaluate our framework on various state-of-the-art transformer-based models. Experiments demonstrate that these models have superior performance on the MS-COCO dataset under our framework.

Original languageEnglish
Title of host publicationIJCNN 2023 - International Joint Conference on Neural Networks, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665488679
DOIs
StatePublished - 2023
Event2023 International Joint Conference on Neural Networks, IJCNN 2023 - Gold Coast, Australia
Duration: 18 Jun 202323 Jun 2023

Publication series

NameProceedings of the International Joint Conference on Neural Networks
Volume2023-June

Conference

Conference2023 International Joint Conference on Neural Networks, IJCNN 2023
Country/TerritoryAustralia
CityGold Coast
Period18/06/2323/06/23

Keywords

  • Part-Of-Speech
  • Transformer
  • image captioning
  • visual-linguistic fusion

Fingerprint

Dive into the research topics of 'Image Alone Are Not Enough: A General Semantic-Augmented Transformer-Based Framework for Image Captioning'. Together they form a unique fingerprint.

Cite this