Overcoming Feature Contamination by Unidirectional Information Modeling for Vision-Language Tracking

Jingchao Wang, Zhijian Wu, Wenlong Zhang, Wenhui Liu, Jianwei Zhang, Dingjiang Huang*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Benefiting from the advantages of multi-modal learning, Vision-Language Tracking shows greater potential than Visual Tracking. Existing work utilizes one-stream structures to fuse vision and language features, resulting in noise propagation from the search region into the language features. This contamination weakens the guidance of language information, consequently limiting the robustness of the tracking model. To solve this problem, we propose a Unidirectional Information modeling (UITracker) to explicitly fuse the language and visual features for Vision-Language Tracking. Specifically, we introduce a plug-and-play lightweight modal adapter to unidirectionally inject language guidance into the visual template and search region across all layers. This allows the tracker to make full use of rich semantic information while overcoming language feature contamination in the feature interaction process. Extensive ablation studies demonstrate the superiority and effectiveness of our UITracker. Code and raw results are available at https://github.com/jcwang0602/UITrack.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Multimedia and Expo
Subtitle of host publicationJourney to the Center of Machine Imagination, ICME 2025 - Conference Proceedings
PublisherIEEE Computer Society
ISBN (Electronic)9798331594954
DOIs
StatePublished - 2025
Event2025 IEEE International Conference on Multimedia and Expo, ICME 2025 - Nantes, France
Duration: 30 Jun 20254 Jul 2025

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Country/TerritoryFrance
CityNantes
Period30/06/254/07/25

Keywords

  • object tracking
  • vision-language tracking

Fingerprint

Dive into the research topics of 'Overcoming Feature Contamination by Unidirectional Information Modeling for Vision-Language Tracking'. Together they form a unique fingerprint.

Cite this