TY - JOUR
T1 - Cross-Modal Recipe Retrieval With Fine-Grained Prompting Alignment and Evidential Semantic Consistency
AU - Huang, Xu
AU - Liu, Jin
AU - Zhang, Zhizhong
AU - Xie, Yuan
AU - Tang, Yongqiang
AU - Zhang, Wensheng
AU - Cui, Xiaohui
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Alignment between the food images and the corresponding recipes is an emerging cross-modal representation learning task. In this task, the recipes are composed of three components, i.e., food title, ingredient lists, and cooking instructions, which require a fine-grained alignment between the features of the two modalities. Existing methods usually aggregate the recipes into global embeddings and then align them with the global image embeddings. Meanwhile, semantic classification is frequently used in these methods to regularize the embeddings of the two modalities. While these methods are efficient, there remain two problems. (1) Forcing the alignment between the global images and recipes embeddings may result in losing the component-specific information. (2) The high diversity of food appearance leads to high uncertainty in the semantic classification of food images and recipes. To solve these problems, we propose a Fine-grained Prompting and Alignment (FPA) model to enhance the feature extraction and bring more component-specific information for fine-grained alignment. Furthermore, to regularize the semantic information contained in the cross-modal features, we design an Evidential Semantic Consistency (ESC) loss to keep the cross-modal semantic consistency. We have conducted comprehensive experiments on the benchmark dataset Recipe1M and the state-of-the-art results on the cross-modal recipe retrieval task demonstrate the effectiveness of our method.
AB - Alignment between the food images and the corresponding recipes is an emerging cross-modal representation learning task. In this task, the recipes are composed of three components, i.e., food title, ingredient lists, and cooking instructions, which require a fine-grained alignment between the features of the two modalities. Existing methods usually aggregate the recipes into global embeddings and then align them with the global image embeddings. Meanwhile, semantic classification is frequently used in these methods to regularize the embeddings of the two modalities. While these methods are efficient, there remain two problems. (1) Forcing the alignment between the global images and recipes embeddings may result in losing the component-specific information. (2) The high diversity of food appearance leads to high uncertainty in the semantic classification of food images and recipes. To solve these problems, we propose a Fine-grained Prompting and Alignment (FPA) model to enhance the feature extraction and bring more component-specific information for fine-grained alignment. Furthermore, to regularize the semantic information contained in the cross-modal features, we design an Evidential Semantic Consistency (ESC) loss to keep the cross-modal semantic consistency. We have conducted comprehensive experiments on the benchmark dataset Recipe1M and the state-of-the-art results on the cross-modal recipe retrieval task demonstrate the effectiveness of our method.
KW - Cross-modal recipe retrieval
KW - evidential deep learning
KW - prompt learning
UR - https://www.scopus.com/pages/publications/85189645759
U2 - 10.1109/TMM.2024.3384672
DO - 10.1109/TMM.2024.3384672
M3 - 文章
AN - SCOPUS:85189645759
SN - 1520-9210
VL - 27
SP - 2783
EP - 2794
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -