25 research outputs found
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Recently, there has been growing evidence that if the width and depth of a
neural network are scaled toward the so-called rich feature learning limit
(P and its depth extension), then some hyperparameters - such as the
learning rate - exhibit transfer from small to very large models, thus reducing
the cost of hyperparameter tuning. From an optimization perspective, this
phenomenon is puzzling, as it implies that the loss landscape is remarkably
consistent across very different model sizes. In this work, we find empirical
evidence that learning rate transfer can be attributed to the fact that under
P and its depth extension, the largest eigenvalue of the training loss
Hessian (i.e. the sharpness) is largely independent of the width and depth of
the network for a sustained period of training time. On the other hand, we show
that under the neural tangent kernel (NTK) regime, the sharpness exhibits very
different dynamics at different scales, thus preventing learning rate transfer.
But what causes these differences in the sharpness dynamics? Through a
connection between the spectra of the Hessian and the NTK matrix, we argue that
the cause lies in the presence (for P) or progressive absence (for the NTK
regime) of feature learning, which results in a different evolution of the NTK,
and thus of the sharpness. We corroborate our claims with a substantial suite
of experiments, covering a wide range of datasets and architectures: from
ResNets and Vision Transformers trained on benchmark vision datasets to
Transformers-based language models trained on WikiTex
Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning
In contrast to the natural capabilities of humans to learn new tasks in a
sequential fashion, neural networks are known to suffer from catastrophic
forgetting, where the model's performances on old tasks drop dramatically after
being optimized for a new task. Since then, the continual learning (CL)
community has proposed several solutions aiming to equip the neural network
with the ability to learn the current task (plasticity) while still achieving
high accuracy on the previous tasks (stability). Despite remarkable
improvements, the plasticity-stability trade-off is still far from being solved
and its underlying mechanism is poorly understood. In this work, we propose
Auxiliary Network Continual Learning (ANCL), a novel method that applies an
additional auxiliary network which promotes plasticity to the continually
learned model which mainly focuses on stability. More concretely, the proposed
framework materializes in a regularizer that naturally interpolates between
plasticity and stability, surpassing strong baselines on task incremental and
class incremental scenarios. Through extensive analyses on ANCL solutions, we
identify some essential principles beneath the stability-plasticity trade-off.Comment: CVPR 202
Disentangling Linear Mode-Connectivity
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing
characteristics of neural network loss landscapes. While empirically well
established, it unfortunately still lacks a proper theoretical understanding.
Even worse, although empirical data points are abound, a systematic study of
when networks exhibit LMC is largely missing in the literature. In this work we
aim to close this gap. We explore how LMC is affected by three factors: (1)
architecture (sparsity, weight-sharing), (2) training strategy (optimization
setup) as well as (3) the underlying dataset. We place particular emphasis on
minimal but non-trivial settings, removing as much unnecessary complexity as
possible. We believe that our insights can guide future theoretical works on
uncovering the inner workings of LMC.Comment: 9 pages, 5 figure
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
The cost of hyperparameter tuning in deep learning has been rising with model
sizes, prompting practitioners to find new tuning methods using a proxy of
smaller networks. One such proposal uses P parameterized networks, where
the optimal hyperparameters for small width networks transfer to networks with
arbitrarily large width. However, in this scheme, hyperparameters do not
transfer across depths. As a remedy, we study residual networks with a residual
branch scale of in combination with the P
parameterization. We provide experiments demonstrating that residual
architectures including convolutional ResNets and Vision Transformers trained
with this parameterization exhibit transfer of optimal hyperparameters across
width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings
are supported and motivated by theory. Using recent developments in the
dynamical mean field theory (DMFT) description of neural network learning
dynamics, we show that this parameterization of ResNets admits a well-defined
feature learning joint infinite-width and infinite-depth limit and show
convergence of finite-size network dynamics towards this limit
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their
computational cost, most of LLMs still adopt attention layers between all pairs
of tokens in the sequence, thus incurring a quadratic cost. In this study, we
present a novel approach that dynamically prunes contextual information while
preserving the model's expressiveness, resulting in reduced memory and
computational requirements during inference. Our method employs a learnable
mechanism that determines which uninformative tokens can be dropped from the
context at any point across the generation process. By doing so, our approach
not only addresses performance concerns but also enhances interpretability,
providing valuable insight into the model's decision-making process. Our
technique can be applied to existing pre-trained models through a
straightforward fine-tuning process, and the pruning strength can be specified
by a sparsity parameter. Notably, our empirical findings demonstrate that we
can effectively prune up to 80\% of the context without significant performance
degradation on downstream tasks, offering a valuable tool for mitigating
inference costs. Our reference implementation achieves up to increase
in inference throughput and even greater memory savings
How Tempering Fixes Data Augmentation in Bayesian Neural Networks
While Bayesian neural networks (BNNs) provide a sound and principled
alternative to standard neural networks, an artificial sharpening of the
posterior usually needs to be applied to reach comparable performance. This is
in stark contrast to theory, dictating that given an adequate prior and a
well-specified model, the untempered Bayesian posterior should achieve optimal
performance. Despite the community's extensive efforts, the observed gains in
performance still remain disputed with several plausible causes pointing at its
origin. While data augmentation has been empirically recognized as one of the
main drivers of this effect, a theoretical account of its role, on the other
hand, is largely missing. In this work we identify two interlaced factors
concurrently influencing the strength of the cold posterior effect, namely the
correlated nature of augmentations and the degree of invariance of the employed
model to such transformations. By theoretically analyzing simplified settings,
we prove that tempering implicitly reduces the misspecification arising from
modeling augmentations as i.i.d. data. The temperature mimics the role of the
effective sample size, reflecting the gain in information provided by the
augmentations. We corroborate our theoretical findings with extensive empirical
evaluations, scaling to realistic BNNs. By relying on the framework of group
convolutions, we experiment with models of varying inherent degree of
invariance, confirming its hypothesized relationship with the optimal
temperature
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Transformers have achieved remarkable success in several domains, ranging
from natural language processing to computer vision. Nevertheless, it has been
recently shown that stacking self-attention layers - the distinctive
architectural component of Transformers - can result in rank collapse of the
tokens' representations at initialization. The question of if and how rank
collapse affects training is still largely unanswered, and its investigation is
necessary for a more comprehensive understanding of this architecture. In this
work, we shed new light on the causes and the effects of this phenomenon.
First, we show that rank collapse of the tokens' representations hinders
training by causing the gradients of the queries and keys to vanish at
initialization. Furthermore, we provide a thorough description of the origin of
rank collapse and discuss how to prevent it via an appropriate depth-dependent
scaling of the residual branches. Finally, our analysis unveils that specific
architectural hyperparameters affect the gradients of queries and values
differently, leading to disproportionate gradient norms. This suggests an
explanation for the widespread use of adaptive methods for Transformers'
optimization