TY - GEN
T1 - Fine-Grained Scene Image Classification with Modality-Agnostic Adapter
AU - Wang, Yiqun
AU - Zhou, Zhao
AU - Du, Xiangcheng
AU - Wu, Xingjiao
AU - Zheng, Yingbin
AU - Jin, Cheng
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance.
AB - When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance.
KW - Fine-grained scene image classification
KW - Transformer encoder
KW - multimodal feature fusion
UR - https://www.scopus.com/pages/publications/85206577176
U2 - 10.1109/ICME57554.2024.10687875
DO - 10.1109/ICME57554.2024.10687875
M3 - 会议稿件
AN - SCOPUS:85206577176
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PB - IEEE Computer Society
T2 - 2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Y2 - 15 July 2024 through 19 July 2024
ER -