TY - GEN
T1 - Overcoming Feature Contamination by Unidirectional Information Modeling for Vision-Language Tracking
AU - Wang, Jingchao
AU - Wu, Zhijian
AU - Zhang, Wenlong
AU - Liu, Wenhui
AU - Zhang, Jianwei
AU - Huang, Dingjiang
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Benefiting from the advantages of multi-modal learning, Vision-Language Tracking shows greater potential than Visual Tracking. Existing work utilizes one-stream structures to fuse vision and language features, resulting in noise propagation from the search region into the language features. This contamination weakens the guidance of language information, consequently limiting the robustness of the tracking model. To solve this problem, we propose a Unidirectional Information modeling (UITracker) to explicitly fuse the language and visual features for Vision-Language Tracking. Specifically, we introduce a plug-and-play lightweight modal adapter to unidirectionally inject language guidance into the visual template and search region across all layers. This allows the tracker to make full use of rich semantic information while overcoming language feature contamination in the feature interaction process. Extensive ablation studies demonstrate the superiority and effectiveness of our UITracker. Code and raw results are available at https://github.com/jcwang0602/UITrack.
AB - Benefiting from the advantages of multi-modal learning, Vision-Language Tracking shows greater potential than Visual Tracking. Existing work utilizes one-stream structures to fuse vision and language features, resulting in noise propagation from the search region into the language features. This contamination weakens the guidance of language information, consequently limiting the robustness of the tracking model. To solve this problem, we propose a Unidirectional Information modeling (UITracker) to explicitly fuse the language and visual features for Vision-Language Tracking. Specifically, we introduce a plug-and-play lightweight modal adapter to unidirectionally inject language guidance into the visual template and search region across all layers. This allows the tracker to make full use of rich semantic information while overcoming language feature contamination in the feature interaction process. Extensive ablation studies demonstrate the superiority and effectiveness of our UITracker. Code and raw results are available at https://github.com/jcwang0602/UITrack.
KW - object tracking
KW - vision-language tracking
UR - https://www.scopus.com/pages/publications/105022641246
U2 - 10.1109/ICME59968.2025.11209477
DO - 10.1109/ICME59968.2025.11209477
M3 - 会议稿件
AN - SCOPUS:105022641246
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2025 IEEE International Conference on Multimedia and Expo
PB - IEEE Computer Society
T2 - 2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Y2 - 30 June 2025 through 4 July 2025
ER -