Combining clustering and representation learning is one of the most promising
approaches for unsupervised learning of deep neural networks. However, doing so
naively leads to ill posed learning problems with degenerate solutions. In this
paper, we propose a novel and principled learning formulation that addresses
these issues. The method is obtained by maximizing the information between
labels and input data indices. We show that this criterion extends standard
crossentropy minimization to an optimal transport problem, which we solve
efficiently for millions of input images and thousands of labels using a fast
variant of the Sinkhorn-Knopp algorithm. The resulting method is able to
self-label visual data so as to train highly competitive image representations
without manual labels. Our method achieves state of the art representation
learning performance for AlexNet and ResNet-50 on SVHN, CIFAR-10, CIFAR-100 and
ImageNet and yields the first self-supervised AlexNet that outperforms the
supervised Pascal VOC detection baseline. Code and models are available.Comment: Accepted paper at the International Conference on Learning
Representations (ICLR) 202