54 research outputs found
EMP-SSL: Towards Self-Supervised Learning in One Training Epoch
Recently, self-supervised learning (SSL) has achieved tremendous success in
learning image representation. Despite the empirical success, most
self-supervised learning methods are rather "inefficient" learners, typically
taking hundreds of training epochs to fully converge. In this work, we show
that the key towards efficient self-supervised learning is to increase the
number of crops from each image instance. Leveraging one of the
state-of-the-art SSL method, we introduce a simplistic form of self-supervised
learning method called Extreme-Multi-Patch Self-Supervised-Learning (EMP-SSL)
that does not rely on many heuristic techniques for SSL such as weight sharing
between the branches, feature-wise normalization, output quantization, and stop
gradient, etc, and reduces the training epochs by two orders of magnitude. We
show that the proposed method is able to converge to 85.1% on CIFAR-10, 58.5%
on CIFAR-100, 38.1% on Tiny ImageNet and 58.5% on ImageNet-100 in just one
epoch. Furthermore, the proposed method achieves 91.5% on CIFAR-10, 70.1% on
CIFAR-100, 51.5% on Tiny ImageNet and 78.9% on ImageNet-100 with linear probing
in less than ten training epochs. In addition, we show that EMP-SSL shows
significantly better transferability to out-of-domain datasets compared to
baseline SSL methods. We will release the code in
https://github.com/tsb0601/EMP-SSL
Variance-Covariance Regularization Improves Representation Learning
Transfer learning has emerged as a key approach in the machine learning
domain, enabling the application of knowledge derived from one domain to
improve performance on subsequent tasks. Given the often limited information
about these subsequent tasks, a strong transfer learning approach calls for the
model to capture a diverse range of features during the initial pretraining
stage. However, recent research suggests that, without sufficient
regularization, the network tends to concentrate on features that primarily
reduce the pretraining loss function. This tendency can result in inadequate
feature learning and impaired generalization capability for target tasks. To
address this issue, we propose Variance-Covariance Regularization (VCR), a
regularization technique aimed at fostering diversity in the learned network
features. Drawing inspiration from recent advancements in the self-supervised
learning approach, our approach promotes learned representations that exhibit
high variance and minimal covariance, thus preventing the network from focusing
solely on loss-reducing features.
We empirically validate the efficacy of our method through comprehensive
experiments coupled with in-depth analytical studies on the learned
representations. In addition, we develop an efficient implementation strategy
that assures minimal computational overhead associated with our method. Our
results indicate that VCR is a powerful and efficient method for enhancing
transfer learning performance for both supervised learning and self-supervised
learning, opening new possibilities for future research in this domain.Comment: 16 pages, 2 figure
Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning
Self-supervised learning (SSL) has recently achieved tremendous empirical
advancements in learning image representation. However, our understanding of
the principle behind learning such a representation is still limited. This work
shows that joint-embedding SSL approaches primarily learn a representation of
image patches, which reflects their co-occurrence. Such a connection to
co-occurrence modeling can be established formally, and it supplements the
prevailing invariance perspective. We empirically show that learning a
representation for fixed-scale patches and aggregating local patch
representations as the image representation achieves similar or even better
results than the baseline methods. We denote this process as BagSSL. Even with
32x32 patch representation, BagSSL achieves 62% top-1 linear probing accuracy
on ImageNet. On the other hand, with a multi-scale pretrained model, we show
that the whole image embedding is approximately the average of local patch
embeddings. While the SSL representation is relatively invariant at the global
scale, we show that locality is preserved when we zoom into local patch-level
representation. Further, we show that patch representation aggregation can
improve various SOTA baseline methods by a large margin. The patch
representation is considerably easier to understand, and this work makes a step
to demystify self-supervised representation learning
Minimalistic Unsupervised Learning with the Sparse Manifold Transform
We describe a minimalistic and interpretable method for unsupervised
learning, without resorting to data augmentation, hyperparameter tuning, or
other engineering designs, that achieves performance close to the SOTA SSL
methods. Our approach leverages the sparse manifold transform, which unifies
sparse coding, manifold learning, and slow feature analysis. With a one-layer
deterministic sparse manifold transform, one can achieve 99.3% KNN top-1
accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100.
With a simple gray-scale augmentation, the model gets 83.2% KNN top-1 accuracy
on CIFAR-10 and 57% on CIFAR-100. These results significantly close the gap
between simplistic ``white-box'' methods and the SOTA methods. Additionally, we
provide visualization to explain how an unsupervised representation transform
is formed. The proposed method is closely connected to latent-embedding
self-supervised methods and can be treated as the simplest form of VICReg.
Though there remains a small performance gap between our simple constructive
model and SOTA methods, the evidence points to this as a promising direction
for achieving a principled and white-box approach to unsupervised learning
- …