As an important area in computer vision, object tracking has formed two
separate communities that respectively study Single Object Tracking (SOT) and
Multiple Object Tracking (MOT). However, current methods in one tracking
scenario are not easily adapted to the other due to the divergent training
datasets and tracking objects of both tasks. Although UniTrack
\cite{wang2021different} demonstrates that a shared appearance model with
multiple heads can be used to tackle individual tracking tasks, it fails to
exploit the large-scale tracking datasets for training and performs poorly on
single object tracking. In this work, we present the Unified Transformer
Tracker (UTT) to address tracking problems in different scenarios with one
paradigm. A track transformer is developed in our UTT to track the target in
both SOT and MOT. The correlation between the target and tracking frame
features is exploited to localize the target. We demonstrate that both SOT and
MOT tasks can be solved within this framework. The model can be simultaneously
end-to-end trained by alternatively optimizing the SOT and MOT objectives on
the datasets of individual tasks. Extensive experiments are conducted on
several benchmarks with a unified model trained on SOT and MOT datasets. Code
will be available at https://github.com/Flowerfan/Trackron.Comment: CVPR 202