While deep-learning based tracking methods have achieved substantial
progress, they entail large-scale and high-quality annotated data for
sufficient training. To eliminate expensive and exhaustive annotation, we study
self-supervised learning for visual tracking. In this work, we develop the
Crop-Transform-Paste operation, which is able to synthesize sufficient training
data by simulating various appearance variations during tracking, including
appearance variations of objects and background interference. Since the target
state is known in all synthesized data, existing deep trackers can be trained
in routine ways using the synthesized data without human annotation. The
proposed target-aware data-synthesis method adapts existing tracking approaches
within a self-supervised learning framework without algorithmic changes. Thus,
the proposed self-supervised learning mechanism can be seamlessly integrated
into existing tracking frameworks to perform training. Extensive experiments
show that our method 1) achieves favorable performance against supervised
learning schemes under the cases with limited annotations; 2) helps deal with
various tracking challenges such as object deformation, occlusion, or
background clutter due to its manipulability; 3) performs favorably against
state-of-the-art unsupervised tracking methods; 4) boosts the performance of
various state-of-the-art supervised learning frameworks, including SiamRPN++,
DiMP, and TransT (based on Transformer).Comment: 11 pages, 7 figure