跳到主要导航 跳到搜索 跳到主要内容

Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Cross-modal recipe retrieval is an emerging visual-textual retrieval task, which aims at matching food images with the corresponding recipes. Although large-scale Vision-Language Pre-training (VLP) models have achieved impressive performance on a wide range of downstream tasks, they still perform unsatisfactorily on this cross-modal retrieval task due to the following two problems: (1) Features from food images and recipes need to be aligned, simply fine-tuning the pre-trained VLP model's image encoder does not explicitly help with this goal. (2) The text content in the recipe is more structured than the text caption in the VLP model's pre-training corpus, which prevents the VLP model from adapting to the recipe retrieval task. In this paper, we propose a Component-aware Instance-specific Prompt learning (CIP) model that fully exploits the ability of large-scale VLP models. CIP enables us to learn the structured recipe information and therefore allows for aligning visual-textual representations without fine-tuning. Furthermore, we construct a recipe encoder termed Adaptive Recipe Merger (ARM) based on hierarchical Transformers, encouraging the model to learn more effective recipe representations. Extensive experiments on the public Recipe1M dataset demonstrate the superiority of our proposed method by outperforming the state-of-the-art methods on cross-modal recipe retrieval task.

源语言英语
主期刊名MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
529-537
页数9
ISBN(电子版)9798400701085
DOI
出版状态已出版 - 27 10月 2023
活动31st ACM International Conference on Multimedia, MM 2023 - Ottawa, 加拿大
期限: 29 10月 20233 11月 2023

出版系列

姓名MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

会议

会议31st ACM International Conference on Multimedia, MM 2023
国家/地区加拿大
Ottawa
时期29/10/233/11/23

指纹

探究 'Improving Cross-Modal Recipe Retrieval with Component-Aware Prompted CLIP Embedding' 的科研主题。它们共同构成独一无二的指纹。

引用此