13 research outputs found
Lifelong Generative Modeling
Lifelong learning is the problem of learning multiple consecutive tasks in a
sequential manner, where knowledge gained from previous tasks is retained and
used to aid future learning over the lifetime of the learner. It is essential
towards the development of intelligent machines that can adapt to their
surroundings. In this work we focus on a lifelong learning approach to
unsupervised generative modeling, where we continuously incorporate newly
observed distributions into a learned model. We do so through a student-teacher
Variational Autoencoder architecture which allows us to learn and preserve all
the distributions seen so far, without the need to retain the past data nor the
past models. Through the introduction of a novel cross-model regularizer,
inspired by a Bayesian update rule, the student model leverages the information
learned by the teacher, which acts as a probabilistic knowledge store. The
regularizer reduces the effect of catastrophic interference that appears when
we learn over sequences of distributions. We validate our model's performance
on sequential variants of MNIST, FashionMNIST, PermutedMNIST, SVHN and Celeb-A
and demonstrate that our model mitigates the effects of catastrophic
interference faced by neural networks in sequential learning scenarios.Comment: 32 page
Continual Classification Learning Using Generative Models
Continual learning is the ability to sequentially learn over time by
accommodating knowledge while retaining previously learned experiences. Neural
networks can learn multiple tasks when trained on them jointly, but cannot
maintain performance on previously learned tasks when tasks are presented one
at a time. This problem is called catastrophic forgetting. In this work, we
propose a classification model that learns continuously from sequentially
observed tasks, while preventing catastrophic forgetting. We build on the
lifelong generative capabilities of [10] and extend it to the classification
setting by deriving a new variational bound on the joint log likelihood, .Comment: 5 pages, 4 figures, under review in Continual learning Workshop NIPS
201
Kanerva++: extending The Kanerva Machine with differentiable, locally block allocated latent memory
Episodic and semantic memory are critical components of the human memory
model. The theory of complementary learning systems (McClelland et al., 1995)
suggests that the compressed representation produced by a serial event
(episodic memory) is later restructured to build a more generalized form of
reusable knowledge (semantic memory). In this work we develop a new principled
Bayesian memory allocation scheme that bridges the gap between episodic and
semantic memory via a hierarchical latent variable model. We take inspiration
from traditional heap allocation and extend the idea of locally contiguous
memory to the Kanerva Machine, enabling a novel differentiable block allocated
latent memory. In contrast to the Kanerva Machine, we simplify the process of
memory writing by treating it as a fully feed forward deterministic process,
relying on the stochasticity of the read key distribution to disperse
information within the memory. We demonstrate that this allocation scheme
improves performance in memory conditional image generation, resulting in new
state-of-the-art conditional likelihood values on binarized MNIST (<=41.58
nats/image) , binarized Omniglot (<=66.24 nats/image), as well as presenting
competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32x32
Variational Saccading: Efficient Inference for Large Resolution Images
Image classification with deep neural networks is typically restricted to
images of small dimensionality such as 224 x 244 in Resnet models [24]. This
limitation excludes the 4000 x 3000 dimensional images that are taken by modern
smartphone cameras and smart devices. In this work, we aim to mitigate the
prohibitive inferential and memory costs of operating in such large dimensional
spaces. To sample from the high-resolution original input distribution, we
propose using a smaller proxy distribution to learn the co-ordinates that
correspond to regions of interest in the high-dimensional space. We introduce a
new principled variational lower bound that captures the relationship of the
proxy distribution's posterior and the original image's co-ordinate space in a
way that maximizes the conditional classification likelihood. We empirically
demonstrate on one synthetic benchmark and one real world large resolution DSLR
camera image dataset that our method produces comparable results with ~10x
faster inference and lower memory consumption than a model that utilizes the
entire original input distribution. Finally, we experiment with a more complex
setting using mini-maps from Starcraft II [56] to infer the number of
characters in a complex 3d-rendered scene. Even in such complicated scenes our
model provides strong localization: a feature missing from traditional
classification models.Comment: Published BMVC 2019 & NIPS 2018 Bayesian Deep Learning Worksho
The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning
The mechanisms behind the success of multi-view self-supervised learning
(MVSSL) are not yet fully understood. Contrastive MVSSL methods have been
studied through the lens of InfoNCE, a lower bound of the Mutual Information
(MI). However, the relation between other MVSSL methods and MI remains unclear.
We consider a different lower bound on the MI consisting of an entropy and a
reconstruction term (ER), and analyze the main MVSSL families through its lens.
Through this ER bound, we show that clustering-based methods such as
DeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of
distillation-based approaches such as BYOL and DINO, showing that they
explicitly maximize the reconstruction term and implicitly encourage a stable
entropy, and we confirm this empirically. We show that replacing the objectives
of common MVSSL methods with this ER bound achieves competitive performance,
while making them stable when training with smaller batch sizes or smaller
exponential moving average (EMA) coefficients.
Github repo: https://github.com/apple/ml-entropy-reconstruction.Comment: 18 pages: 9 of main text, 2 of references, and 7 of supplementary
material. Appears in the proceedings of ICML 202
DUET: 2D Structured and Approximately Equivariant Representations
Multiview Self-Supervised Learning (MSSL) is based on learning invariances
with respect to a set of input transformations. However, invariance partially
or totally removes transformation-related information from the representations,
which might harm performance for specific downstream tasks that require such
information. We propose 2D strUctured and EquivarianT representations (coined
DUET), which are 2d representations organized in a matrix structure, and
equivariant with respect to transformations acting on the input data. DUET
representations maintain information about an input transformation, while
remaining semantically expressive. Compared to SimCLR (Chen et al., 2020)
(unstructured and invariant) and ESSL (Dangovski et al., 2022) (unstructured
and equivariant), the structured and equivariant nature of DUET representations
enables controlled generation with lower reconstruction error, while
controllability is not possible with SimCLR or ESSL. DUET also achieves higher
accuracy for several discriminative tasks, and improves transfer learning.Comment: Accepted at ICML 202
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Training stability is of great importance to Transformers. In this work, we
investigate the training dynamics of Transformers by examining the evolution of
the attention layers. In particular, we track the attention entropy for each
attention head during the course of training, which is a proxy for model
sharpness. We identify a common pattern across different architectures and
tasks, where low attention entropy is accompanied by high training instability,
which can take the form of oscillating loss or divergence. We denote the
pathologically low attention entropy, corresponding to highly concentrated
attention scores, as . As a remedy, we propose
Reparam, a simple and efficient solution where we reparametrize all
linear layers with spectral normalization and an additional learned scalar. We
demonstrate that the proposed reparameterization successfully prevents entropy
collapse in the attention layers, promoting more stable training. Additionally,
we prove a tight lower bound of the attention entropy, which decreases
exponentially fast with the spectral norm of the attention logits, providing
additional motivation for our approach. We conduct experiments with
Reparam on image classification, image self-supervised learning,
machine translation, automatic speech recognition, and language modeling tasks,
across Transformer architectures. We show that Reparam provides
stability and robustness with respect to the choice of hyperparameters, going
so far as enabling training (a) a Vision Transformer to competitive performance
without warmup, weight decay, layer normalization or adaptive optimizers; (b)
deep architectures in machine translation and (c) speech recognition to
competitive performance without warmup and adaptive optimizers
Position Prediction as an Effective Pretraining Strategy
Transformers have gained increasing popularity in a wide range of
applications, including Natural Language Processing (NLP), Computer Vision and
Speech Recognition, because of their powerful representational capacity.
However, harnessing this representational capacity effectively requires a large
amount of data, strong regularization, or both, to mitigate overfitting.
Recently, the power of the Transformer has been unlocked by self-supervised
pretraining strategies based on masked autoencoders which rely on
reconstructing masked inputs, directly, or contrastively from unmasked content.
This pretraining strategy which has been used in BERT models in NLP, Wav2Vec
models in Speech and, recently, in MAE models in Vision, forces the model to
learn about relationships between the content in different parts of the input
using autoencoding related objectives. In this paper, we propose a novel, but
surprisingly simple alternative to content reconstruction~-- that of predicting
locations from content, without providing positional information for it. Doing
so requires the Transformer to understand the positional relationships between
different parts of the input, from their content alone. This amounts to an
efficient implementation where the pretext task is a classification problem
among all possible positions for each input token. We experiment on both Vision
and Speech benchmarks, where our approach brings improvements over strong
supervised training baselines and is comparable to modern
unsupervised/self-supervised pretraining methods. Our method also enables
Transformers trained without position embeddings to outperform ones trained
with full position information.Comment: Accepted to ICML 202