24 research outputs found
On the relationship between class selectivity, dimensionality, and robustness
While the relative trade-offs between sparse and distributed representations
in deep neural networks (DNNs) are well-studied, less is known about how these
trade-offs apply to representations of semantically-meaningful information.
Class selectivity, the variability of a unit's responses across data classes or
dimensions, is one way of quantifying the sparsity of semantic representations.
Given recent evidence showing that class selectivity can impair generalization,
we sought to investigate whether it also confers robustness (or vulnerability)
to perturbations of input data. We found that mean class selectivity predicts
vulnerability to naturalistic corruptions; networks regularized to have lower
levels of class selectivity are more robust to corruption, while networks with
higher class selectivity are more vulnerable to corruption, as measured using
Tiny ImageNetC and CIFAR10C. In contrast, we found that class selectivity
increases robustness to multiple types of gradient-based adversarial attacks.
To examine this difference, we studied the dimensionality of the change in the
representation due to perturbation, finding that decreasing class selectivity
increases the dimensionality of this change for both corruption types, but with
a notably larger increase for adversarial attacks. These results demonstrate
the causal relationship between selectivity and robustness and provide new
insights into the mechanisms of this relationship
Insights on representational similarity in neural networks with canonical correlation
Comparing different neural network representations and determining how
representations evolve over time remain challenging open questions in our
understanding of the function of neural networks. Comparing representations in
neural networks is fundamentally difficult as the structure of representations
varies greatly, even across groups of networks trained on identical tasks, and
over the course of training. Here, we develop projection weighted CCA
(Canonical Correlation Analysis) as a tool for understanding neural networks,
building off of SVCCA, a recently proposed method (Raghu et al., 2017). We
first improve the core method, showing how to differentiate between signal and
noise, and then apply this technique to compare across a group of CNNs,
demonstrating that networks which generalize converge to more similar
representations than networks which memorize, that wider networks converge to
more similar solutions than narrow networks, and that trained networks with
identical topology but different learning rates converge to distinct clusters
with diverse representations. We also investigate the representational dynamics
of RNNs, across both training and sequential timesteps, finding that RNNs
converge in a bottom-up pattern over the course of training and that the hidden
state is highly variable over the course of a sequence, even when accounting
for linear transforms. Together, these results provide new insights into the
function of CNNs and RNNs, and demonstrate the utility of using CCA to
understand representations.Comment: NIPS 201
Plan2Vec: Unsupervised Representation Learning by Latent Plans
In this paper we introduce plan2vec, an unsupervised representation learning
approach that is inspired by reinforcement learning. Plan2vec constructs a
weighted graph on an image dataset using near-neighbor distances, and then
extrapolates this local metric to a global embedding by distilling
path-integral over planned path. When applied to control, plan2vec offers a way
to learn goal-conditioned value estimates that are accurate over long horizons
that is both compute and sample efficient. We demonstrate the effectiveness of
plan2vec on one simulated and two challenging real-world image datasets.
Experimental results show that plan2vec successfully amortizes the planning
cost, enabling reactive planning that is linear in memory and computation
complexity rather than exhaustive over the entire state space.Comment: code available at https://geyang.github.io/plan2ve
Grounding inductive biases in natural images:invariance stems from variations in data
To perform well on unseen and potentially out-of-distribution samples, it is
desirable for machine learning models to have a predictable response with
respect to transformations affecting the factors of variation of the input.
Here, we study the relative importance of several types of inductive biases
towards such predictable behavior: the choice of data, their augmentations, and
model architectures. Invariance is commonly achieved through hand-engineered
data augmentation, but do standard data augmentations address transformations
that explain variations in real data? While prior work has focused on synthetic
data, we attempt here to characterize the factors of variation in a real
dataset, ImageNet, and study the invariance of both standard residual networks
and the recently proposed vision transformer with respect to changes in these
factors. We show standard augmentation relies on a precise combination of
translation and scale, with translation recapturing most of the performance
improvement -- despite the (approximate) translation invariance built in to
convolutional architectures, such as residual networks. In fact, we found that
scale and translation invariance was similar across residual networks and
vision transformer models despite their markedly different architectural
inductive biases. We show the training data itself is the main source of
invariance, and that data augmentation only further increases the learned
invariances. Notably, the invariances learned during training align with the
ImageNet factors of variation we found. Finally, we find that the main factors
of variation in ImageNet mostly relate to appearance and are specific to each
class
Learning to Make Analogies by Contrasting Abstract Relational Structure
Analogical reasoning has been a principal focus of various waves of AI
research. Analogy is particularly challenging for machines because it requires
relational structures to be represented such that they can be flexibly applied
across diverse domains of experience. Here, we study how analogical reasoning
can be induced in neural networks that learn to perceive and reason about raw
visual data. We find that the critical factor for inducing such a capacity is
not an elaborate architecture, but rather, careful attention to the choice of
data and the manner in which it is presented to the model. The most robust
capacity for analogical reasoning is induced when networks learn analogies by
contrasting abstract relational structures in their input domains, a training
method that uses only the input data to force models to learn about important
abstract features. Using this technique we demonstrate capacities for complex,
visual and symbolic analogy making and generalisation in even the simplest
neural network architectures
Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs
A wide variety of deep learning techniques from style transfer to multitask
learning rely on training affine transformations of features. Most prominent
among these is the popular feature normalization technique BatchNorm, which
normalizes activations and then subsequently applies a learned affine
transform. In this paper, we aim to understand the role and expressive power of
affine parameters used to transform features in this way. To isolate the
contribution of these parameters from that of the learned features they
transform, we investigate the performance achieved when training only these
parameters in BatchNorm and freezing all weights at their random
initializations. Doing so leads to surprisingly high performance considering
the significant limitations that this style of training imposes. For example,
sufficiently deep ResNets reach 82% (CIFAR-10) and 32% (ImageNet, top-5)
accuracy in this configuration, far higher than when training an equivalent
number of randomly chosen parameters elsewhere in the network. BatchNorm
achieves this performance in part by naturally learning to disable around a
third of the random features. Not only do these results highlight the
expressive power of affine parameters in deep learning, but - in a broader
sense - they characterize the expressive power of neural networks constructed
simply by shifting and rescaling random features.Comment: Published in ICLR 202
The Early Phase of Neural Network Training
Recent studies have shown that many important aspects of neural network
learning take place within the very earliest iterations or epochs of training.
For example, sparse, trainable sub-networks emerge (Frankle et al., 2019),
gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the
network undergoes a critical period (Achille et al., 2019). Here, we examine
the changes that deep neural networks undergo during this early phase of
training. We perform extensive measurements of the network state during these
early iterations of training and leverage the framework of Frankle et al.
(2019) to quantitatively probe the weight distribution and its reliance on
various aspects of the dataset. We find that, within this framework, deep
networks are not robust to reinitializing with random weights while maintaining
signs, and that weight distributions are highly non-independent even after only
a few hundred iterations. Despite this behavior, pre-training with blurred
inputs or an auxiliary self-supervised task can approximate the changes in
supervised networks, suggesting that these changes are not inherently
label-dependent, though labels significantly accelerate this process. Together,
these results help to elucidate the network changes occurring during this
pivotal initial period of learning.Comment: ICLR 2020 Camera Ready. Available on OpenReview at
https://openreview.net/forum?id=Hkl1iRNFw
One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers
The success of lottery ticket initializations (Frankle and Carbin, 2019)
suggests that small, sparsified networks can be trained so long as the network
is initialized appropriately. Unfortunately, finding these "winning ticket"
initializations is computationally expensive. One potential solution is to
reuse the same winning tickets across a variety of datasets and optimizers.
However, the generality of winning ticket initializations remains unclear.
Here, we attempt to answer this question by generating winning tickets for one
training configuration (optimizer and dataset) and evaluating their performance
on another configuration. Perhaps surprisingly, we found that, within the
natural images domain, winning ticket initializations generalized across a
variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and
Places365, often achieving performance close to that of winning tickets
generated on the same dataset. Moreover, winning tickets generated using larger
datasets consistently transferred better than those generated using smaller
datasets. We also found that winning ticket initializations generalize across
optimizers with high performance. These results suggest that winning ticket
initializations generated by sufficiently large datasets contain inductive
biases generic to neural networks more broadly which improve training across
many settings and provide hope for the development of better initialization
methods.Comment: NeurIPS 201
Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP
The lottery ticket hypothesis proposes that over-parameterization of deep
neural networks (DNNs) aids training by increasing the probability of a "lucky"
sub-network initialization being present rather than by helping the
optimization process (Frankle & Carbin, 2019). Intriguingly, this phenomenon
suggests that initialization strategies for DNNs can be improved substantially,
but the lottery ticket hypothesis has only previously been tested in the
context of supervised learning for natural image tasks. Here, we evaluate
whether "winning ticket" initializations exist in two different domains:
natural language processing (NLP) and reinforcement learning (RL).For NLP, we
examined both recurrent LSTM models and large-scale Transformer models (Vaswani
et al., 2017). For RL, we analyzed a number of discrete-action space tasks,
including both classic control and pixel control. Consistent with workin
supervised image classification, we confirm that winning ticket initializations
generally outperform parameter-matched random initializations, even at extreme
pruning rates for both NLP and RL. Notably, we are able to find winning ticket
initializations for Transformers which enable models one-third the size to
achieve nearly equivalent performance. Together, these results suggest that the
lottery ticket hypothesis is not restricted to supervised learning of natural
images, but rather represents a broader phenomenon in DNNs.Comment: ICLR 202
Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs
Many of our core assumptions about how neural networks operate remain
empirically untested. One common assumption is that convolutional neural
networks need to be stable to small translations and deformations to solve
image recognition tasks. For many years, this stability was baked into CNN
architectures by incorporating interleaved pooling layers. Recently, however,
interleaved pooling has largely been abandoned. This raises a number of
questions: Are our intuitions about deformation stability right at all? Is it
important? Is pooling necessary for deformation invariance? If not, how is
deformation invariance achieved in its absence? In this work, we rigorously
test these questions, and find that deformation stability in convolutional
networks is more nuanced than it first appears: (1) Deformation invariance is
not a binary property, but rather that different tasks require different
degrees of deformation stability at different layers. (2) Deformation stability
is not a fixed property of a network and is heavily adjusted over the course of
training, largely through the smoothness of the convolutional filters. (3)
Interleaved pooling layers are neither necessary nor sufficient for achieving
the optimal form of deformation stability for natural image classification. (4)
Pooling confers too much deformation stability for image classification at
initialization, and during training, networks have to learn to counteract this
inductive bias. Together, these findings provide new insights into the role of
interleaved pooling and deformation invariance in CNNs, and demonstrate the
importance of rigorous empirical testing of even our most basic assumptions
about the working of neural networks.Comment: NIPS 2018 submissio