Visual Prompt Multi-Modal Tracking

Chen, Xin; Lai, Simiao; Lu, Huchuan; Wang, Dong; Zhu, Jiawen

Visual Prompt Multi-Modal Tracking

Authors: Xin Chen
Simiao Lai
Huchuan Lu
Dong Wang
Jiawen Zhu
Publication date: 24 March 2023
Publisher

Abstract

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.Comment: Accepted by CVPR202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.10826

Last time updated on 28/03/2023