TY - JOUR
T1 - A fine-grained vision and language representation framework with graph-based fashion semantic knowledge
AU - Ding, Huiming
AU - Wang, Sen
AU - Xie, Zhifeng
AU - Li, Mengtian
AU - Ma, Lizhuang
N1 - Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2023/10
Y1 - 2023/10
N2 - Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.
AB - Vision and language representation learning has been demonstrated to be an effective means of enhancing multimodal task performance. However, fashion-specific studies have predominantly focused on object-level features, which might neglect to capture region-level visual features and fail to represent the fine-grained correlations between words in fashion descriptions. To address these issues, we propose a novel framework to achieve a fine-grained vision and language representation in the fashion domain. Specifically, we construct a knowledge-dependency graph structure from fashion descriptions and then aggregate it with word-level embedding, which can strengthen the fashion semantic knowledge and obtain fine-grained textual representations. Moreover, we fine-tune a region-aware fashion segmentation network to capture region-level visual features, and then introduce local vision and language contrastive learning for pulling closer the fine-grained textual representations to the region-level visual features in the same garment. Extensive experiments on downstream tasks, including cross-modal retrieval, category/subcategory recognition, and text-guided image retrieval, demonstrate the superiority of our method over state-of-the-art methods.
KW - Contrastive learning
KW - Fashion semantic knowledge
KW - Graph neural network
KW - Vision and language representation
UR - https://www.scopus.com/pages/publications/85165328284
U2 - 10.1016/j.cag.2023.07.025
DO - 10.1016/j.cag.2023.07.025
M3 - 文章
AN - SCOPUS:85165328284
SN - 0097-8493
VL - 115
SP - 216
EP - 225
JO - Computers and Graphics
JF - Computers and Graphics
ER -