Self-supervised learning (SSL) methods targeting scene images have seen a
rapid growth recently, and they mostly rely on either a dedicated dense
matching mechanism or a costly unsupervised object discovery module. This paper
shows that instead of hinging on these strenuous operations, quality image
representations can be learned by treating scene/multi-label image SSL simply
as a multi-label classification problem, which greatly simplifies the learning
framework. Specifically, multiple binary pseudo-labels are assigned for each
input image by comparing its embeddings with those in two dictionaries, and the
network is optimized using the binary cross entropy loss. The proposed method
is named Multi-Label Self-supervised learning (MLS). Visualizations
qualitatively show that clearly the pseudo-labels by MLS can automatically find
semantically similar pseudo-positive pairs across different images to
facilitate contrastive learning. MLS learns high quality representations on
MS-COCO and achieves state-of-the-art results on classification, detection and
segmentation benchmarks. At the same time, MLS is much simpler than existing
methods, making it easier to deploy and for further exploration.Comment: ICCV202