Recent progress in multiple object tracking (MOT) has shown that a robust
similarity score is key to the success of trackers. A good similarity score is
expected to reflect multiple cues, e.g. appearance, location, and topology,
over a long period of time. However, these cues are heterogeneous, making them
hard to be combined in a unified network. As a result, existing methods usually
encode them in separate networks or require a complex training approach. In
this paper, we present a unified framework for similarity measurement which
could simultaneously encode various cues and perform reasoning across both
spatial and temporal domains. We also study the feature representation of a
tracklet-object pair in depth, showing a proper design of the pair features can
well empower the trackers. The resulting approach is named spatial-temporal
relation networks (STRN). It runs in a feed-forward way and can be trained in
an end-to-end manner. The state-of-the-art accuracy was achieved on all of the
MOT15-17 benchmarks using public detection and online settings