2 research outputs found
Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the
advancements in visual tracking to the tracking by natural language (NL)
descriptions task. The proposed SNLT is applicable to a wide range of Siamese
trackers, providing a new class of baselines for the tracking by NL task and
promising future improvements from the advancements of Siamese trackers. The
carefully designed architecture of the Siamese Natural Language Region Proposal
Network (SNL-RPN), together with the Dynamic Aggregation of vision and language
modalities, is introduced to perform the tracking by NL task. Empirical results
over tracking benchmarks with NL annotations show that the proposed SNLT
improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of
speed. The proposed SNLT outperforms all NL trackers to-date and is competitive
among state-of-the-art real-time trackers on LaSOT benchmarks while running at
50 frames per second on a single GPU.Comment: CVPR 202
Towards More Flexible and Accurate Object Tracking with Natural Language: Algorithms and Benchmark
Tracking by natural language specification is a new rising research topic
that aims at locating the target object in the video sequence based on its
language description. Compared with traditional bounding box (BBox) based
tracking, this setting guides object tracking with high-level semantic
information, addresses the ambiguity of BBox, and links local and global search
organically together. Those benefits may bring more flexible, robust and
accurate tracking performance in practical scenarios. However, existing natural
language initialized trackers are developed and compared on benchmark datasets
proposed for tracking-by-BBox, which can't reflect the true power of
tracking-by-language. In this work, we propose a new benchmark specifically
dedicated to the tracking-by-language, including a large scale dataset, strong
and diverse baseline methods. Specifically, we collect 2k video sequences
(contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the
train/testing respectively. We densely annotate one sentence in English and
corresponding bounding boxes of the target object for each video. We also
introduce two new challenges into TNL2K for the object tracking task, i.e.,
adversarial samples and modality switch. A strong baseline method based on an
adaptive local-global-search scheme is proposed for future works to compare. We
believe this benchmark will greatly boost related researches on natural
language guided tracking.Comment: Accepted by CVPR 202