10 research outputs found
REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to
distribution shifts between train and test data (1) without access to the
training data, and (2) without knowledge of the model training procedure. In
online F-TTA, a pre-trained model is adapted using a stream of test samples by
minimizing a self-supervised objective, such as entropy minimization. However,
models adapted with online using entropy minimization, are unstable especially
in single sample settings, leading to degenerate solutions, and limiting the
adoption of TTA inference strategies. Prior works identify noisy, or
unreliable, samples as a cause of failure in online F-TTA. One solution is to
ignore these samples, which can lead to bias in the update procedure, slow
adaptation, and poor generalization. In this work, we present a general
framework for improving robustness of F-TTA to these noisy samples, inspired by
self-paced learning and robust loss functions. Our proposed approach, Robust
Entropy Adaptive Loss Minimization (REALM), achieves better adaptation accuracy
than previous approaches throughout the adaptation process on corruptions of
CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.Comment: Accepted at WACV 2024, 17 pages, 7 figures, 11 table
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Training stability is of great importance to Transformers. In this work, we
investigate the training dynamics of Transformers by examining the evolution of
the attention layers. In particular, we track the attention entropy for each
attention head during the course of training, which is a proxy for model
sharpness. We identify a common pattern across different architectures and
tasks, where low attention entropy is accompanied by high training instability,
which can take the form of oscillating loss or divergence. We denote the
pathologically low attention entropy, corresponding to highly concentrated
attention scores, as . As a remedy, we propose
Reparam, a simple and efficient solution where we reparametrize all
linear layers with spectral normalization and an additional learned scalar. We
demonstrate that the proposed reparameterization successfully prevents entropy
collapse in the attention layers, promoting more stable training. Additionally,
we prove a tight lower bound of the attention entropy, which decreases
exponentially fast with the spectral norm of the attention logits, providing
additional motivation for our approach. We conduct experiments with
Reparam on image classification, image self-supervised learning,
machine translation, automatic speech recognition, and language modeling tasks,
across Transformer architectures. We show that Reparam provides
stability and robustness with respect to the choice of hyperparameters, going
so far as enabling training (a) a Vision Transformer to competitive performance
without warmup, weight decay, layer normalization or adaptive optimizers; (b)
deep architectures in machine translation and (c) speech recognition to
competitive performance without warmup and adaptive optimizers
The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning
The mechanisms behind the success of multi-view self-supervised learning
(MVSSL) are not yet fully understood. Contrastive MVSSL methods have been
studied through the lens of InfoNCE, a lower bound of the Mutual Information
(MI). However, the relation between other MVSSL methods and MI remains unclear.
We consider a different lower bound on the MI consisting of an entropy and a
reconstruction term (ER), and analyze the main MVSSL families through its lens.
Through this ER bound, we show that clustering-based methods such as
DeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of
distillation-based approaches such as BYOL and DINO, showing that they
explicitly maximize the reconstruction term and implicitly encourage a stable
entropy, and we confirm this empirically. We show that replacing the objectives
of common MVSSL methods with this ER bound achieves competitive performance,
while making them stable when training with smaller batch sizes or smaller
exponential moving average (EMA) coefficients.
Github repo: https://github.com/apple/ml-entropy-reconstruction.Comment: 18 pages: 9 of main text, 2 of references, and 7 of supplementary
material. Appears in the proceedings of ICML 202
DUET: 2D Structured and Approximately Equivariant Representations
Multiview Self-Supervised Learning (MSSL) is based on learning invariances
with respect to a set of input transformations. However, invariance partially
or totally removes transformation-related information from the representations,
which might harm performance for specific downstream tasks that require such
information. We propose 2D strUctured and EquivarianT representations (coined
DUET), which are 2d representations organized in a matrix structure, and
equivariant with respect to transformations acting on the input data. DUET
representations maintain information about an input transformation, while
remaining semantically expressive. Compared to SimCLR (Chen et al., 2020)
(unstructured and invariant) and ESSL (Dangovski et al., 2022) (unstructured
and equivariant), the structured and equivariant nature of DUET representations
enables controlled generation with lower reconstruction error, while
controllability is not possible with SimCLR or ESSL. DUET also achieves higher
accuracy for several discriminative tasks, and improves transfer learning.Comment: Accepted at ICML 202
Position Prediction as an Effective Pretraining Strategy
Transformers have gained increasing popularity in a wide range of
applications, including Natural Language Processing (NLP), Computer Vision and
Speech Recognition, because of their powerful representational capacity.
However, harnessing this representational capacity effectively requires a large
amount of data, strong regularization, or both, to mitigate overfitting.
Recently, the power of the Transformer has been unlocked by self-supervised
pretraining strategies based on masked autoencoders which rely on
reconstructing masked inputs, directly, or contrastively from unmasked content.
This pretraining strategy which has been used in BERT models in NLP, Wav2Vec
models in Speech and, recently, in MAE models in Vision, forces the model to
learn about relationships between the content in different parts of the input
using autoencoding related objectives. In this paper, we propose a novel, but
surprisingly simple alternative to content reconstruction~-- that of predicting
locations from content, without providing positional information for it. Doing
so requires the Transformer to understand the positional relationships between
different parts of the input, from their content alone. This amounts to an
efficient implementation where the pretext task is a classification problem
among all possible positions for each input token. We experiment on both Vision
and Speech benchmarks, where our approach brings improvements over strong
supervised training baselines and is comparable to modern
unsupervised/self-supervised pretraining methods. Our method also enables
Transformers trained without position embeddings to outperform ones trained
with full position information.Comment: Accepted to ICML 202