657 research outputs found
Distilling Localization for Self-Supervised Representation Learning
Recent progress in contrastive learning has revolutionized unsupervised
representation learning. Concretely, multiple views (augmentations) from the
same image are encouraged to map to the similar embeddings, while views from
different images are pulled apart. In this paper, through visualizing and
diagnosing classification errors, we observe that current contrastive models
are ineffective at localizing the foreground object, limiting their ability to
extract discriminative high-level features. This is due to the fact that view
generation process considers pixels in an image uniformly. To address this
problem, we propose a data-driven approach for learning invariance to
backgrounds. It first estimates foreground saliency in images and then creates
augmentations by copy-and-pasting the foreground onto a variety of backgrounds.
The learning still follows the instance discrimination pretext task, so that
the representation is trained to disregard background content and focus on the
foreground. We study a variety of saliency estimation methods, and find that
most methods lead to improvements for contrastive learning. With this approach
(DiLo), significant performance is achieved for self-supervised learning on
ImageNet classification, and also for object detection on PASCAL VOC and
MSCOCO.Comment: Accepted by AAAI202
Contrastive Transformation for Self-supervised Correspondence Learning
In this paper, we focus on the self-supervised learning of visual
correspondence using unlabeled videos in the wild. Our method simultaneously
considers intra- and inter-video representation associations for reliable
correspondence estimation. The intra-video learning transforms the image
contents across frames within a single video via the frame pair-wise affinity.
To obtain the discriminative representation for instance-level separation, we
go beyond the intra-video analysis and construct the inter-video affinity to
facilitate the contrastive transformation across different videos. By forcing
the transformation consistency between intra- and inter-video levels, the
fine-grained correspondence associations are well preserved and the
instance-level feature discrimination is effectively reinforced. Our simple
framework outperforms the recent self-supervised correspondence methods on a
range of visual tasks including video object tracking (VOT), video object
segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that
our method also surpasses the fully-supervised affinity representation (e.g.,
ResNet) and performs competitively against the recent fully-supervised
algorithms designed for the specific tasks (e.g., VOT and VOS).Comment: To appear in AAAI 202
- …