3,945 research outputs found
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
This paper presents a simple yet effective framework MaskCLIP, which
incorporates a newly proposed masked self-distillation into contrastive
language-image pretraining. The core idea of masked self-distillation is to
distill representation from a full image to the representation predicted from a
masked image. Such incorporation enjoys two vital benefits. First, masked
self-distillation targets local patch representation learning, which is
complementary to vision-language contrastive focusing on text-related
representation. Second, masked self-distillation is also consistent with
vision-language contrastive from the perspective of training objective as both
utilize the visual encoder for feature aligning, and thus is able to learn
local semantics getting indirect supervision from the language. We provide
specially designed experiments with a comprehensive analysis to validate the
two benefits. Symmetrically, we also introduce the local semantic supervision
into the text branch, which further improves the pretraining performance. With
extensive experiments, we show that MaskCLIP, when applied to various
challenging downstream tasks, achieves superior results in linear probing,
finetuning, and zero-shot performance with the guidance of the language
encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.Comment: CVPR 2023, code is available at https://github.com/LightDXY/MaskCLI
Unsupervised Contrastive Representation Learning for Knowledge Distillation and Clustering
Unsupervised contrastive learning has emerged as an important training strategy to learn representation by pulling positive samples closer and pushing negative samples apart in low-dimensional latent space. Usually, positive samples are the augmented versions of the same input and negative samples are from different inputs. Once the low-dimensional representations are learned, further analysis, such as clustering, and classification can be performed using the representations. Currently, there are two challenges in this framework. First, the empirical studies reveal that even though contrastive learning methods show great progress in representation learning on large model training, they do not work well for small models. Second, this framework has achieved excellent clustering results on small datasets but has limitations on the datasets with a large number of clusters such as ImageNet. In this dissertation, our research goal is to develop new unsupervised contrastive representation learning methods and apply them to knowledge distillation and clustering.
The knowledge distillation transfers knowledge from high-capacity teachers to small student models and then improves the performance of students. And the representational knowledge distillation methods try to distill the knowledge of representations from teachers to students. Current representational knowledge distillation methods undesirably push apart representations of samples from the same class in their correlation objectives, leading to inferior distillation results. Here, we introduce Dual-level Knowledge Distillation (DLKD) by explicitly combining knowledge alignment and knowledge correlation instead of using one single contrastive objective. We show that both knowledge alignment and knowledge correlation are necessary to improve distillation performance. The proposed DLKD is task-agnostic and model-agnostic and enables effective knowledge transfer from supervised or self-supervised trained teachers to students. Experiments demonstrate that DLKD outperforms other state-of-the-art methods in a large number of experimental settings including different (a) pretraining strategies (b) network architectures (c) datasets and (d) tasks.
Currently, the two-stage framework is widely used in deep learning-based clustering, namely, learning representation first, then clustering algorithms, such as K-means, are usually performed on representations to obtain cluster assignment. However, the learned representation may not be optimized for clustering in this two-stage framework. Here, we propose Contrastive Learning-based Clustering (CLC), which uses contrastive learning to directly learn cluster assignment. We decompose the representation into two parts: one encodes the categorical information under an equipartition constraint, and the other captures the instance-wise factors. We theoretically analyze the proposed contrastive loss and reveal that CLC sets different weights for the negative samples while learning cluster assignments. Therefore, the proposed loss has high expressiveness that enables us to efficiently learn cluster assignments. Experimental evaluation shows that CLC achieves overall state-of-the-art or highly competitive clustering performance on multiple benchmark datasets. In particular, we achieve 53.4% accuracy on the full ImageNet dataset and outperform existing methods by large margins (+ 10.2%)
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation
This paper tackles the problem of semi-supervised video object segmentation
on resource-constrained devices, such as mobile phones. We formulate this
problem as a distillation task, whereby we demonstrate that small
space-time-memory networks with finite memory can achieve competitive results
with state of the art, but at a fraction of the computational cost (32
milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a
theoretically grounded framework that unifies knowledge distillation with
supervised contrastive representation learning. These models are able to
jointly benefit from both pixel-wise contrastive learning and distillation from
a pre-trained teacher. We validate this loss by achieving competitive J&F to
state of the art on both the standard DAVIS and YouTube benchmarks, despite
running up to 5x faster, and with 32x fewer parameters.Comment: CVPR 202
EnSiam: Self-Supervised Learning With Ensemble Representations
Recently, contrastive self-supervised learning, where the proximity of
representations is determined based on the identities of samples, has made
remarkable progress in unsupervised representation learning. SimSiam is a
well-known example in this area, known for its simplicity yet powerful
performance. However, it is known to be sensitive to changes in training
configurations, such as hyperparameters and augmentation settings, due to its
structural characteristics. To address this issue, we focus on the similarity
between contrastive learning and the teacher-student framework in knowledge
distillation. Inspired by the ensemble-based knowledge distillation approach,
the proposed method, EnSiam, aims to improve the contrastive learning procedure
using ensemble representations. This can provide stable pseudo labels,
providing better performance. Experiments demonstrate that EnSiam outperforms
previous state-of-the-art methods in most cases, including the experiments on
ImageNet, which shows that EnSiam is capable of learning high-quality
representations
Adaptive Similarity Bootstrapping for Self-Distillation
Most self-supervised methods for representation learning leverage a
cross-view consistency objective i.e. they maximize the representation
similarity of a given image's augmented views. Recent work NNCLR goes beyond
the cross-view paradigm and uses positive pairs from different images obtained
via nearest neighbor bootstrapping in a contrastive setting. We empirically
show that as opposed to the contrastive learning setting which relies on
negative samples, incorporating nearest neighbor bootstrapping in a
self-distillation scheme can lead to a performance drop or even collapse. We
scrutinize the reason for this unexpected behavior and provide a solution. We
propose to adaptively bootstrap neighbors based on the estimated quality of the
latent space. We report consistent improvements compared to the naive
bootstrapping approach and the original baselines. Our approach leads to
performance improvements for various self-distillation method/backbone
combinations and standard downstream tasks. Our code will be released upon
acceptance.Comment: * denotes equal contributio
- …