We propose ADCLR: A ccurate and D ense Contrastive Representation Learning, a
novel self-supervised learning framework for learning accurate and dense vision
representation. To extract spatial-sensitive information, ADCLR introduces
query patches for contrasting in addition with global contrasting. Compared
with previous dense contrasting methods, ADCLR mainly enjoys three merits: i)
achieving both global-discriminative and spatial-sensitive representation, ii)
model-efficient (no extra parameters in addition to the global contrasting
baseline), and iii) correspondence-free and thus simpler to implement. Our
approach achieves new state-of-the-art performance for contrastive methods. On
classification tasks, for ViT-S, ADCLR achieves 77.5% top-1 accuracy on
ImageNet with linear probing, outperforming our baseline (DINO) without our
devised techniques as plug-in, by 0.5%. For ViT-B, ADCLR achieves 79.8%, 84.0%
accuracy on ImageNet by linear probing and finetune, outperforming iBOT by
0.3%, 0.2% accuracy. For dense tasks, on MS-COCO, ADCLR achieves significant
improvements of 44.3% AP on object detection, 39.7% AP on instance
segmentation, outperforming previous SOTA method SelfPatch by 2.2% and 1.2%,
respectively. On ADE20K, ADCLR outperforms SelfPatch by 1.0% mIoU, 1.2% mAcc on
the segm