A fine-grained vision and language representation framework with graph-based fashion semantic knowledge

Huiming Ding, Sen Wang, Zhifeng Xie, Mengtian Li, Lizhuang Ma

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.

Original languageEnglish
Pages (from-to)216-225
Number of pages10
JournalComputers and Graphics
Volume115
DOIs
StatePublished - Oct 2023
Externally publishedYes

Keywords

  • Contrastive learning
  • Fashion semantic knowledge
  • Graph neural network
  • Vision and language representation

Fingerprint

Dive into the research topics of 'A fine-grained vision and language representation framework with graph-based fashion semantic knowledge'. Together they form a unique fingerprint.

Cite this