Recent self-supervised computer vision methods have demonstrated equal or
better performance to supervised methods, opening for AI systems to learn
visual representations from practically unlimited data. However, these methods
are classification-based and thus ineffective for learning dense feature maps
required for unsupervised semantic segmentation. This work presents a method to
effectively learn dense semantically rich visual concept embeddings applicable
to high-resolution images. We introduce superpixelization as a means to
decompose images into a small set of visually coherent regions, allowing
efficient learning of dense semantics by swapped prediction. The expressiveness
of our dense embeddings is demonstrated by significantly improving the SOTA
representation quality benchmarks on COCO (+16.27 mIoU) and Cityscapes (+19.24
mIoU) for both low- and high-resolution images