2,636 research outputs found
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Unsupervised image representations have significantly reduced the gap with
supervised pretraining, notably with the recent achievements of contrastive
learning methods. These contrastive methods typically work online and rely on a
large number of explicit pairwise feature comparisons, which is computationally
challenging. In this paper, we propose an online algorithm, SwAV, that takes
advantage of contrastive methods without requiring to compute pairwise
comparisons. Specifically, our method simultaneously clusters the data while
enforcing consistency between cluster assignments produced for different
augmentations (or views) of the same image, instead of comparing features
directly as in contrastive learning. Simply put, we use a swapped prediction
mechanism where we predict the cluster assignment of a view from the
representation of another view. Our method can be trained with large and small
batches and can scale to unlimited amounts of data. Compared to previous
contrastive methods, our method is more memory efficient since it does not
require a large memory bank or a special momentum network. In addition, we also
propose a new data augmentation strategy, multi-crop, that uses a mix of views
with different resolutions in place of two full-resolution views, without
increasing the memory or compute requirements much. We validate our findings by
achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as
surpassing supervised pretraining on all the considered transfer tasks.Comment: NeurIPS 202
Deep Multiview Clustering by Contrasting Cluster Assignments
Multiview clustering (MVC) aims to reveal the underlying structure of
multiview data by categorizing data samples into clusters. Deep learning-based
methods exhibit strong feature learning capabilities on large-scale datasets.
For most existing deep MVC methods, exploring the invariant representations of
multiple views is still an intractable problem. In this paper, we propose a
cross-view contrastive learning (CVCL) method that learns view-invariant
representations and produces clustering results by contrasting the cluster
assignments among multiple views. Specifically, we first employ deep
autoencoders to extract view-dependent features in the pretraining stage. Then,
a cluster-level CVCL strategy is presented to explore consistent semantic label
information among the multiple views in the fine-tuning stage. Thus, the
proposed CVCL method is able to produce more discriminative cluster assignments
by virtue of this learning strategy. Moreover, we provide a theoretical
analysis of soft cluster assignment alignment. Extensive experimental results
obtained on several datasets demonstrate that the proposed CVCL method
outperforms several state-of-the-art approaches.Comment: 10 pages, 7 figure
Self-Supervised Visual Representation Learning with Semantic Grouping
In this paper, we tackle the problem of learning visual representations from
unlabeled scene-centric data. Existing works have demonstrated the potential of
utilizing the underlying complex structure within scene-centric data; still,
they commonly rely on hand-crafted objectness priors or specialized pretext
tasks to build a learning framework, which may harm generalizability. Instead,
we propose contrastive learning from data-driven semantic slots, namely
SlotCon, for joint semantic grouping and representation learning. The semantic
grouping is performed by assigning pixels to a set of learnable prototypes,
which can adapt to each sample by attentive pooling over the feature and form
new slots. Based on the learned data-dependent slots, a contrastive objective
is employed for representation learning, which enhances the discriminability of
features, and conversely facilitates grouping semantically coherent pixels
together. Compared with previous efforts, by simultaneously optimizing the two
coupled objectives of semantic grouping and contrastive learning, our approach
bypasses the disadvantages of hand-crafted priors and is able to learn
object/group-level representations from scene-centric images. Experiments show
our approach effectively decomposes complex scenes into semantic groups for
feature learning and significantly benefits downstream tasks, including object
detection, instance segmentation, and semantic segmentation. Code is available
at: https://github.com/CVMI-Lab/SlotCon.Comment: Accepted at NeurIPS 202
Generalized Category Discovery with Clustering Assignment Consistency
Generalized category discovery (GCD) is a recently proposed open-world task.
Given a set of images consisting of labeled and unlabeled instances, the goal
of GCD is to automatically cluster the unlabeled samples using information
transferred from the labeled dataset. The unlabeled dataset comprises both
known and novel classes. The main challenge is that unlabeled novel class
samples and unlabeled known class samples are mixed together in the unlabeled
dataset. To address the GCD without knowing the class number of unlabeled
dataset, we propose a co-training-based framework that encourages clustering
consistency. Specifically, we first introduce weak and strong augmentation
transformations to generate two sufficiently different views for the same
sample. Then, based on the co-training assumption, we propose a consistency
representation learning strategy, which encourages consistency between
feature-prototype similarity and clustering assignment. Finally, we use the
discriminative embeddings learned from the semi-supervised representation
learning process to construct an original sparse network and use a community
detection method to obtain the clustering results and the number of categories
simultaneously. Extensive experiments show that our method achieves
state-of-the-art performance on three generic benchmarks and three fine-grained
visual recognition datasets. Especially in the ImageNet-100 data set, our
method significantly exceeds the best baseline by 15.5\% and 7.0\% on the
\texttt{Novel} and \texttt{All} classes, respectively.Comment: ICONIP 2023,This paper has been nominated for ICONIP2023 Best Paper
Awar
Self-Supervised Classification Network
We present Self-Classifier -- a novel self-supervised end-to-end
classification learning approach. Self-Classifier learns labels and
representations simultaneously in a single-stage end-to-end manner by
optimizing for same-class prediction of two augmented views of the same sample.
To guarantee non-degenerate solutions (i.e., solutions where all labels are
assigned to the same class) we propose a mathematically motivated variant of
the cross-entropy loss that has a uniform prior asserted on the predicted
labels. In our theoretical analysis we prove that degenerate solutions are not
in the set of optimal solutions of our approach. Self-Classifier is simple to
implement and scalable. Unlike other popular unsupervised classification and
contrastive representation learning approaches, it does not require any form of
pre-training, expectation maximization, pseudo-labelling, external clustering,
a second network, stop-gradient operation or negative pairs. Despite its
simplicity, our approach sets a new state of the art for unsupervised
classification of ImageNet; and even achieves comparable to state-of-the-art
results for unsupervised representation learning. Code:
https://github.com/elad-amrani/self-classifierComment: Update method and add experiment
Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing
Self-supervised visual representation learning has recently attracted
significant research interest. While a common way to evaluate self-supervised
representations is through transfer to various downstream tasks, we instead
investigate the problem of measuring their interpretability, i.e. understanding
the semantics encoded in raw representations. We formulate the latter as
estimating the mutual information between the representation and a space of
manually labelled concepts. To quantify this we introduce a decoding
bottleneck: information must be captured by simple predictors, mapping concepts
to clusters in representation space. This approach, which we call reverse
linear probing, provides a single number sensitive to the semanticity of the
representation. This measure is also able to detect when the representation
contains combinations of concepts (e.g., "red apple") instead of just
individual attributes ("red" and "apple" independently). Finally, we propose to
use supervised classifiers to automatically label large datasets in order to
enrich the space of concepts used for probing. We use our method to evaluate a
large number of self-supervised representations, ranking them by
interpretability, highlight the differences that emerge compared to the
standard evaluation with linear probes and discuss several qualitative
insights. Code at: {\scriptsize{\url{https://github.com/iro-cp/ssl-qrp}}}.Comment: Published at ICLR 2022. Appendix included, 26 page
- …