50 research outputs found
On the reversed bias-variance tradeoff in deep ensembles
Deep ensembles aggregate predictions of diverse neural networks to improve generalisation and quantify uncertainty. Here, we investigate their behavior when increasing the ensemble mem- bers’ parameter size - a practice typically asso- ciated with better performance for single mod- els. We show that under practical assumptions in the overparametrized regime far into the dou- ble descent curve, not only the ensemble test loss degrades, but common out-of-distribution detec- tion and calibration metrics suffer as well. Rem- iniscent to deep double descent, we observe this phenomenon not only when increasing the single member’s capacity but also as we increase the training budget, suggesting deep ensembles can benefit from early stopping. This sheds light on the success and failure modes of deep ensembles and suggests that averaging finite width models perform better than the neural tangent kernel limit for these metrics
A contrastive rule for meta-learning
Meta-learning algorithms leverage regularities that are present on a set of tasks to speed up and improve the performance of a subsidiary learning process. Recent work on deep neural networks has shown that prior gradient-based learning of meta-parameters can greatly improve the efficiency of subsequent learning. Here, we present a biologically plausible meta-learning algorithm based on equilibrium propagation. Instead of explicitly differentiating the learning process, our contrastive meta-learning rule estimates meta-parameter gradients by executing the subsidiary process more than once. This avoids reversing the learning dynamics in time and computing second-order derivatives. In spite of this, and unlike previous first-order methods, our rule recovers an arbitrarily accurate meta-parameter update given enough compute. We establish theoretical bounds on its performance and present experiments on a set of standard benchmarks and neural network architectures
Neural networks with late-phase weights
The largely successful method of training neural networks is to learn their
weights using some variant of stochastic gradient descent (SGD). Here, we show
that the solutions found by SGD can be further improved by ensembling a subset
of the weights in late stages of learning. At the end of learning, we obtain
back a single model by taking a spatial average in weight space. To avoid
incurring increased computational costs, we investigate a family of
low-dimensional late-phase weight models which interact multiplicatively with
the remaining parameters. Our results show that augmenting standard models with
late-phase weights improves generalization in established benchmarks such as
CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a
theoretical analysis of a noisy quadratic problem which provides a simplified
picture of the late phases of neural network learning.Comment: 25 pages, 6 figure
Continual Learning in Recurrent Neural Networks with Hypernetworks
The last decade has seen a surge of interest in continual learning (CL), and
a variety of methods have been developed to alleviate catastrophic forgetting.
However, most prior work has focused on tasks with static data, while CL on
sequential data has remained largely unexplored. Here we address this gap in
two ways. First, we evaluate the performance of established CL methods when
applied to recurrent neural networks (RNNs). We primarily focus on elastic
weight consolidation, which is limited by a stability-plasticity trade-off, and
explore the particularities of this trade-off when using sequential data. We
show that high working memory requirements, but not necessarily sequence
length, lead to an increased need for stability at the cost of decreased
performance on subsequent tasks. Second, to overcome this limitation we employ
a recent method based on hypernetworks and apply it to RNNs to address
catastrophic forgetting on sequential data. By generating the weights of a main
RNN in a task-dependent manner, our approach disentangles stability and
plasticity, and outperforms alternative methods in a range of experiments.
Overall, our work provides several key insights on the differences between CL
in feedforward networks and in RNNs, while offering a novel solution to
effectively tackle CL on sequential data.Comment: 13 pages and 4 figures in the main text; 20 pages and 2 figures in
the supplementary material
Learning where to learn: Gradient sparsity in meta and continual learning
Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. We show that this form of meta-learning can be improved by letting the learning algorithm decide which weights to change, i.e., by learning where to learn. We find that patterned sparsity emerges from this process, with the pattern of sparsity varying on a problem-by-problem basis. This selective sparsity results in better generalization and less interference in a range of few-shot and continual learning problems. Moreover, we find that sparse learning also emerges in a more expressive model where learning rates are meta-learned. Our results shed light on an ongoing debate on whether meta-learning can discover adaptable features and suggest that learning by sparse gradient descent is a powerful inductive bias for meta-learning systems
Neural networks with late-phase weights
The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning
Learning where to learn: Gradient sparsity in meta and continual learning
Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. We show that this form of meta-learning can be improved by letting the learning algorithm decide which weights to change, i.e., by learning where to learn. We find that patterned sparsity emerges from this process, with the pattern of sparsity varying on a problem-by-problem basis. This selective sparsity results in better generalization and less interference in a range of few-shot and continual learning problems. Moreover, we find that sparse learning also emerges in a more expressive model where learning rates are meta-learned. Our results shed light on an ongoing debate on whether meta-learning can discover adaptable features and suggest that learning by sparse gradient descent is a powerful inductive bias for meta-learning systems
Gated recurrent neural networks discover attention
Recent architectural developments have enabled recurrent neural networks
(RNNs) to reach and even surpass the performance of Transformers on certain
sequence modeling tasks. These modern RNNs feature a prominent design pattern:
linear recurrent layers interconnected by feedforward paths with multiplicative
gating. Here, we show how RNNs equipped with these two design elements can
exactly implement (linear) self-attention, the main building block of
Transformers. By reverse-engineering a set of trained RNNs, we find that
gradient descent in practice discovers our construction. In particular, we
examine RNNs trained to solve simple in-context learning tasks on which
Transformers are known to excel and find that gradient descent instills in our
RNNs the same attention-based in-context learning algorithm used by
Transformers. Our findings highlight the importance of multiplicative
interactions in neural networks and suggest that certain RNNs might be
unexpectedly implementing attention under the hood
Random initialisations performing above chance and how to find them
Neural networks trained with stochastic gradient descent (SGD) starting from
different random initialisations typically find functionally very similar
solutions, raising the question of whether there are meaningful differences
between different SGD solutions. Entezari et al.\ recently conjectured that
despite different initialisations, the solutions found by SGD lie in the same
loss valley after taking into account the permutation invariance of neural
networks. Concretely, they hypothesise that any two solutions found by SGD can
be permuted such that the linear interpolation between their parameters forms a
path without significant increases in loss. Here, we use a simple but powerful
algorithm to find such permutations that allows us to obtain direct empirical
evidence that the hypothesis is true in fully connected networks. Strikingly,
we find that two networks already live in the same loss valley at the time of
initialisation and averaging their random, but suitably permuted initialisation
performs significantly above chance. In contrast, for convolutional
architectures, our evidence suggests that the hypothesis does not hold.
Especially in a large learning rate regime, SGD seems to discover diverse
modes.Comment: NeurIPS 2022, 14th Annual Workshop on Optimization for Machine
Learning (OPT2022