Despite recent progress in Multiple Object Tracking (MOT), several obstacles
such as occlusions, similar objects, and complex scenes remain an open
challenge. Meanwhile, a systematic study of the cost-performance tradeoff for
the popular tracking-by-detection paradigm is still lacking. This paper
introduces SMILEtrack, an innovative object tracker that effectively addresses
these challenges by integrating an efficient object detector with a Siamese
network-based Similarity Learning Module (SLM). The technical contributions of
SMILETrack are twofold. First, we propose an SLM that calculates the appearance
similarity between two objects, overcoming the limitations of feature
descriptors in Separate Detection and Embedding (SDE) models. The SLM
incorporates a Patch Self-Attention (PSA) block inspired by the vision
Transformer, which generates reliable features for accurate similarity
matching. Second, we develop a Similarity Matching Cascade (SMC) module with a
novel GATE function for robust object matching across consecutive video frames,
further enhancing MOT performance. Together, these innovations help SMILETrack
achieve an improved trade-off between the cost ({\em e.g.}, running speed) and
performance (e.g., tracking accuracy) over several existing state-of-the-art
benchmarks, including the popular BYTETrack method. SMILETrack outperforms
BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets.
Code is available at https://github.com/pingyang1117/SMILEtrack_Officia