Despite recent progress made by self-supervised methods in representation
learning with residual networks, they still underperform supervised learning on
the ImageNet classification benchmark, limiting their applicability in
performance-critical settings. Building on prior theoretical insights from
ReLIC [Mitrovic et al., 2021], we include additional inductive biases into
self-supervised learning. We propose a new self-supervised representation
learning method, ReLICv2, which combines an explicit invariance loss with a
contrastive objective over a varied set of appropriately constructed data views
to avoid learning spurious correlations and obtain more informative
representations. ReLICv2 achieves 77.1% top-1 accuracy on ImageNet under
linear evaluation on a ResNet50, thus improving the previous state-of-the-art
by absolute +1.5%; on larger ResNet models, ReLICv2 achieves up to 80.6%
outperforming previous self-supervised approaches with margins up to +2.3%.
Most notably, ReLICv2 is the first unsupervised representation learning method
to consistently outperform the supervised baseline in a like-for-like
comparison over a range of ResNet architectures. Using ReLICv2, we also learn
more robust and transferable representations that generalize better
out-of-distribution than previous work, both on image classification and
semantic segmentation. Finally, we show that despite using ResNet encoders,
ReLICv2 is comparable to state-of-the-art self-supervised vision transformers