7 research outputs found
EMP-SSL: Towards Self-Supervised Learning in One Training Epoch
Recently, self-supervised learning (SSL) has achieved tremendous success in
learning image representation. Despite the empirical success, most
self-supervised learning methods are rather "inefficient" learners, typically
taking hundreds of training epochs to fully converge. In this work, we show
that the key towards efficient self-supervised learning is to increase the
number of crops from each image instance. Leveraging one of the
state-of-the-art SSL method, we introduce a simplistic form of self-supervised
learning method called Extreme-Multi-Patch Self-Supervised-Learning (EMP-SSL)
that does not rely on many heuristic techniques for SSL such as weight sharing
between the branches, feature-wise normalization, output quantization, and stop
gradient, etc, and reduces the training epochs by two orders of magnitude. We
show that the proposed method is able to converge to 85.1% on CIFAR-10, 58.5%
on CIFAR-100, 38.1% on Tiny ImageNet and 58.5% on ImageNet-100 in just one
epoch. Furthermore, the proposed method achieves 91.5% on CIFAR-10, 70.1% on
CIFAR-100, 51.5% on Tiny ImageNet and 78.9% on ImageNet-100 with linear probing
in less than ten training epochs. In addition, we show that EMP-SSL shows
significantly better transferability to out-of-domain datasets compared to
baseline SSL methods. We will release the code in
https://github.com/tsb0601/EMP-SSL
Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models
The advent of large pre-trained models has brought about a paradigm shift in
both visual representation learning and natural language processing. However,
clustering unlabeled images, as a fundamental and classic machine learning
problem, still lacks effective solution, particularly for large-scale datasets.
In this paper, we propose a novel image clustering pipeline that leverages the
powerful feature representation of large pre-trained models such as CLIP and
cluster images effectively and efficiently at scale. We show that the
pre-trained features are significantly more structured by further optimizing
the rate reduction objective. The resulting features may significantly improve
the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore,
by leveraging CLIP's image-text binding, we show how the new clustering method
leads to a simple yet effective self-labeling algorithm that successfully works
on unlabeled large datasets such as MS-COCO and LAION-Aesthetics. We will
release the code in https://github.com/LeslieTrue/CPP.Comment: 21 pages, 13 figure
Unsupervised Manifold Linearizing and Clustering
We consider the problem of simultaneously clustering and learning a linear
representation of data lying close to a union of low-dimensional manifolds, a
fundamental task in machine learning and computer vision. When the manifolds
are assumed to be linear subspaces, this reduces to the classical problem of
subspace clustering, which has been studied extensively over the past two
decades. Unfortunately, many real-world datasets such as natural images can not
be well approximated by linear subspaces. On the other hand, numerous works
have attempted to learn an appropriate transformation of the data, such that
data is mapped from a union of general non-linear manifolds to a union of
linear subspaces (with points from the same manifold being mapped to the same
subspace). However, many existing works have limitations such as assuming
knowledge of the membership of samples to clusters, requiring high sampling
density, or being shown theoretically to learn trivial representations. In this
paper, we propose to optimize the Maximal Coding Rate Reduction metric with
respect to both the data representation and a novel doubly stochastic cluster
membership, inspired by state-of-the-art subspace clustering results. We give a
parameterization of such a representation and membership, allowing efficient
mini-batching and one-shot initialization. Experiments on CIFAR-10, -20, -100,
and TinyImageNet-200 datasets show that the proposed method is much more
accurate and scalable than state-of-the-art deep clustering methods, and
further learns a latent linear representation of the data
White-Box Transformers via Sparse Rate Reduction
In this paper, we contend that the objective of representation learning is to
compress and transform the distribution of the data, say sets of tokens,
towards a mixture of low-dimensional Gaussian distributions supported on
incoherent subspaces. The quality of the final representation can be measured
by a unified objective function called sparse rate reduction. From this
perspective, popular deep networks such as transformers can be naturally viewed
as realizing iterative schemes to optimize this objective incrementally.
Particularly, we show that the standard transformer block can be derived from
alternating optimization on complementary parts of this objective: the
multi-head self-attention operator can be viewed as a gradient descent step to
compress the token sets by minimizing their lossy coding rate, and the
subsequent multi-layer perceptron can be viewed as attempting to sparsify the
representation of the tokens. This leads to a family of white-box
transformer-like deep network architectures which are mathematically fully
interpretable. Despite their simplicity, experiments show that these networks
indeed learn to optimize the designed objective: they compress and sparsify
representations of large-scale real-world vision datasets such as ImageNet, and
achieve performance very close to thoroughly engineered transformers such as
ViT. Code is at \url{https://github.com/Ma-Lab-Berkeley/CRATE}.Comment: 33 pages, 11 figure
Recommended from our members
CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction
This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn a Closed-loop Transcriptionbetween a multi-class, multi-dimensional data distribution and a Linear discriminative representation (CTRL) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a two-player minimax game between the encoder and decoderfor the learned representation. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding independent principal subspaces in the feature space, and diverse visual attributes within each class are modeled by the independent principal components within each subspace
CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction
This work proposes a new computational framework for learning a structured generative model for real-world datasets. In particular, we propose to learn a Closed-loop Transcriptionbetween a multi-class, multi-dimensional data distribution and a Linear discriminative representation (CTRL) in the feature space that consists of multiple independent multi-dimensional linear subspaces. In particular, we argue that the optimal encoding and decoding mappings sought can be formulated as a two-player minimax game between the encoder and decoderfor the learned representation. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing of approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: under fair comparison, visual quality of the learned decoder and classification performance of the encoder is competitive and arguably better than existing methods based on GAN, VAE, or a combination of both. Unlike existing generative models, the so-learned features of the multiple classes are structured instead of hidden: different classes are explicitly mapped onto corresponding independent principal subspaces in the feature space, and diverse visual attributes within each class are modeled by the independent principal components within each subspace