Unsupervised learning is a challenging task due to the lack of labels.
Multiple Object Tracking (MOT), which inevitably suffers from mutual object
interference, occlusion, etc., is even more difficult without label
supervision. In this paper, we explore the latent consistency of sample
features across video frames and propose an Unsupervised Contrastive Similarity
Learning method, named UCSL, including three contrast modules: self-contrast,
cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses
intra-frame direct and inter-frame indirect contrast to obtain discriminative
representations by maximizing self-similarity. ii) Cross-contrast aligns cross-
and continuous-frame matching results, mitigating the persistent negative
effect caused by object occlusion. And iii) ambiguity contrast matches
ambiguous objects with each other to further increase the certainty of
subsequent object association through an implicit manner. On existing
benchmarks, our method outperforms the existing unsupervised methods using only
limited help from ReID head, and even provides higher accuracy than lots of
fully supervised methods