50 research outputs found
Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning
In contrast to the natural capabilities of humans to learn new tasks in a
sequential fashion, neural networks are known to suffer from catastrophic
forgetting, where the model's performances on old tasks drop dramatically after
being optimized for a new task. Since then, the continual learning (CL)
community has proposed several solutions aiming to equip the neural network
with the ability to learn the current task (plasticity) while still achieving
high accuracy on the previous tasks (stability). Despite remarkable
improvements, the plasticity-stability trade-off is still far from being solved
and its underlying mechanism is poorly understood. In this work, we propose
Auxiliary Network Continual Learning (ANCL), a novel method that applies an
additional auxiliary network which promotes plasticity to the continually
learned model which mainly focuses on stability. More concretely, the proposed
framework materializes in a regularizer that naturally interpolates between
plasticity and stability, surpassing strong baselines on task incremental and
class incremental scenarios. Through extensive analyses on ANCL solutions, we
identify some essential principles beneath the stability-plasticity trade-off.Comment: CVPR 202
On the Theoretical Properties of Noise Correlation in Stochastic Optimization
Studying the properties of stochastic noise to optimize complex non-convex
functions has been an active area of research in the field of machine learning.
Prior work has shown that the noise of stochastic gradient descent improves
optimization by overcoming undesirable obstacles in the landscape. Moreover,
injecting artificial Gaussian noise has become a popular idea to quickly escape
saddle points. Indeed, in the absence of reliable gradient information, the
noise is used to explore the landscape, but it is unclear what type of noise is
optimal in terms of exploration ability. In order to narrow this gap in our
knowledge, we study a general type of continuous-time non-Markovian process,
based on fractional Brownian motion, that allows for the increments of the
process to be correlated. This generalizes processes based on Brownian motion,
such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such
processes which gives rise to the new algorithm fPGD. This method is a
generalization of the known algorithms PGD and Anti-PGD. We study the
properties of fPGD both theoretically and empirically, demonstrating that it
possesses exploration abilities that, in some cases, are favorable over PGD and
Anti-PGD. These results open the field to novel ways to exploit noise for
training machine learning models
On the Universality of Linear Recurrences Followed by Nonlinear Projections
In this note (work in progress towards a full-length paper) we show that a
family of sequence models based on recurrent linear layers~(including S4, S5,
and the LRU) interleaved with position-wise multi-layer perceptrons~(MLPs) can
approximate arbitrarily well any sufficiently regular non-linear
sequence-to-sequence map. The main idea behind our result is to see recurrent
layers as compression algorithms that can faithfully store information about
the input sequence into an inner state, before it is processed by the highly
expressive MLP.Comment: Accepted at HLD 2023: 1st Workshop on High-dimensional Learning
Dynamic
On the effectiveness of Randomized Signatures as Reservoir for Learning Rough Dynamics
Many finance, physics, and engineering phenomena are modeled by
continuous-time dynamical systems driven by highly irregular (stochastic)
inputs. A powerful tool to perform time series analysis in this context is
rooted in rough path theory and leverages the so-called Signature Transform.
This algorithm enjoys strong theoretical guarantees but is hard to scale to
high-dimensional data. In this paper, we study a recently derived random
projection variant called Randomized Signature, obtained using the
Johnson-Lindenstrauss Lemma. We provide an in-depth experimental evaluation of
the effectiveness of the Randomized Signature approach, in an attempt to
showcase the advantages of this reservoir to the community. Specifically, we
find that this method is preferable to the truncated Signature approach and
alternative deep learning techniques in terms of model complexity, training
time, accuracy, robustness, and data hungriness.Comment: Accepted for IEEE IJCNN 202
An SDE for Modeling SAM: Theory and Insights
We study the SAM (Sharpness-Aware Minimization) optimizer which has recently
attracted a lot of interest due to its increased performance over more
classical variants of stochastic gradient descent. Our main contribution is the
derivation of continuous-time models (in the form of SDEs) for SAM and two of
its variants, both for the full-batch and mini-batch settings. We demonstrate
that these SDEs are rigorous approximations of the real discrete-time
algorithms (in a weak sense, scaling linearly with the learning rate). Using
these models, we then offer an explanation of why SAM prefers flat minima over
sharp ones~--~by showing that it minimizes an implicitly regularized loss with
a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to
saddle points under some realistic conditions. Our theoretical results are
supported by detailed experiments.Comment: Accepted at ICML 2023 (Poster