65 research outputs found
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Despite the widespread practical success of deep learning methods, our
theoretical understanding of the dynamics of learning in deep neural networks
remains quite sparse. We attempt to bridge the gap between the theory and
practice of deep learning by systematically analyzing learning dynamics for the
restricted case of deep linear neural networks. Despite the linearity of their
input-output map, such networks have nonlinear gradient descent dynamics on
weights that change with the addition of each new hidden layer. We show that
deep linear networks exhibit nonlinear learning phenomena similar to those seen
in simulations of nonlinear networks, including long plateaus followed by rapid
transitions to lower error solutions, and faster convergence from greedy
unsupervised pretraining initial conditions than from random initial
conditions. We provide an analytical description of these phenomena by finding
new exact solutions to the nonlinear dynamics of deep learning. Our theoretical
analysis also reveals the surprising finding that as the depth of a network
approaches infinity, learning speed can nevertheless remain finite: for a
special class of initial conditions on the weights, very deep networks incur
only a finite, depth independent, delay in learning speed relative to shallow
networks. We show that, under certain conditions on the training data,
unsupervised pretraining can find this special class of initial conditions,
while scaled random Gaussian initializations cannot. We further exhibit a new
class of random orthogonal initial conditions on weights that, like
unsupervised pre-training, enjoys depth independent learning times. We further
show that these initial conditions also lead to faithful propagation of
gradients even in deep nonlinear networks, as long as they operate in a special
regime known as the edge of chaos.Comment: Submission to ICLR2014. Revised based on reviewer feedbac
A mathematical theory of semantic development in deep neural networks
An extensive body of empirical research has revealed remarkable regularities
in the acquisition, organization, deployment, and neural representation of
human semantic knowledge, thereby raising a fundamental conceptual question:
what are the theoretical principles governing the ability of neural networks to
acquire, organize, and deploy abstract knowledge by integrating across many
individual experiences? We address this question by mathematically analyzing
the nonlinear dynamics of learning in deep linear networks. We find exact
solutions to this learning dynamics that yield a conceptual explanation for the
prevalence of many disparate phenomena in semantic cognition, including the
hierarchical differentiation of concepts through rapid developmental
transitions, the ubiquity of semantic illusions between such transitions, the
emergence of item typicality and category coherence as factors controlling the
speed of semantic processing, changing patterns of inductive projection over
development, and the conservation of semantic similarity in neural
representations across species. Thus, surprisingly, our simple neural model
qualitatively recapitulates many diverse regularities underlying semantic
development, while providing analytic insight into how the statistical
structure of an environment can interact with nonlinear deep learning dynamics
to give rise to these regularities
Meta-Learning Strategies through Value Maximization in Neural Networks
Biological and artificial learning agents face numerous choices about how to
learn, ranging from hyperparameter selection to aspects of task distributions
like curricula. Understanding how to make these meta-learning choices could
offer normative accounts of cognitive control functions in biological learners
and improve engineered systems. Yet optimal strategies remain challenging to
compute in modern deep networks due to the complexity of optimizing through the
entire learning process. Here we theoretically investigate optimal strategies
in a tractable setting. We present a learning effort framework capable of
efficiently optimizing control signals on a fully normative objective:
discounted cumulative performance throughout learning. We obtain computational
tractability by using average dynamical equations for gradient descent,
available for simple neural network architectures. Our framework accommodates a
range of meta-learning and automatic curriculum learning methods in a unified
normative setting. We apply this framework to investigate the effect of
approximations in common meta-learning algorithms; infer aspects of optimal
curricula; and compute optimal neuronal resource allocation in a continual
learning setting. Across settings, we find that control effort is most
beneficial when applied to easier aspects of a task early in learning; followed
by sustained effort on harder aspects. Overall, the learning effort framework
provides a tractable theoretical test bed to study normative benefits of
interventions in a variety of learning systems, as well as a formal account of
optimal cognitive control strategies over learning trajectories posited by
established theories in cognitive neuroscience.Comment: Under Revie
Mice identify subgoal locations through an action-driven mapping process
Mammals form mental maps of the environments by exploring their surroundings. Here, we investigate which elements of exploration are important for this process. We studied mouse escape behavior, in which mice are known to memorize subgoal locations-obstacle edges-to execute efficient escape routes to shelter. To test the role of exploratory actions, we developed closed-loop neural-stimulation protocols for interrupting various actions while mice explored. We found that blocking running movements directed at obstacle edges prevented subgoal learning; however, blocking several control movements had no effect. Reinforcement learning simulations and analysis of spatial data show that artificial agents can match these results if they have a region-level spatial representation and explore with object-directed movements. We conclude that mice employ an action-driven process for integrating subgoals into a hierarchical cognitive map. These findings broaden our understanding of the cognitive toolkit that mammals use to acquire spatial knowledge
Minnorm training: an algorithm for training over-parameterized deep neural networks
In this work, we propose a new training method for finding minimum weight
norm solutions in over-parameterized neural networks (NNs). This method seeks
to improve training speed and generalization performance by framing NN training
as a constrained optimization problem wherein the sum of the norm of the
weights in each layer of the network is minimized, under the constraint of
exactly fitting training data. It draws inspiration from support vector
machines (SVMs), which are able to generalize well, despite often having an
infinite number of free parameters in their primal form, and from recent
theoretical generalization bounds on NNs which suggest that lower norm
solutions generalize better. To solve this constrained optimization problem,
our method employs Lagrange multipliers that act as integrators of error over
training and identify `support vector'-like examples. The method can be
implemented as a wrapper around gradient based methods and uses standard
back-propagation of gradients from the NN for both regression and
classification versions of the algorithm. We provide theoretical justifications
for the effectiveness of this algorithm in comparison to early stopping and
-regularization using simple, analytically tractable settings. In
particular, we show faster convergence to the max-margin hyperplane in a
shallow network (compared to vanilla gradient descent); faster convergence to
the minimum-norm solution in a linear chain (compared to -regularization);
and initialization-independent generalization performance in a deep linear
network. Finally, using the MNIST dataset, we demonstrate that this algorithm
can boost test accuracy and identify difficult examples in real-world datasets
The Transient Nature of Emergent In-Context Learning in Transformers
Transformer neural networks can exhibit a surprising capacity for in-context
learning (ICL) despite not being explicitly trained for it. Prior work has
provided a deeper understanding of how ICL emerges in transformers, e.g.
through the lens of mechanistic interpretability, Bayesian inference, or by
examining the distributional properties of training data. However, in each of
these cases, ICL is treated largely as a persistent phenomenon; namely, once
ICL emerges, it is assumed to persist asymptotically. Here, we show that the
emergence of ICL during transformer training is, in fact, often transient. We
train transformers on synthetic data designed so that both ICL and in-weights
learning (IWL) strategies can lead to correct predictions. We find that ICL
first emerges, then disappears and gives way to IWL, all while the training
loss decreases, indicating an asymptotic preference for IWL. The transient
nature of ICL is observed in transformers across a range of model sizes and
datasets, raising the question of how much to "overtrain" transformers when
seeking compact, cheaper-to-run models. We find that L2 regularization may
offer a path to more persistent ICL that removes the need for early stopping
based on ICL-style validation tasks. Finally, we present initial evidence that
ICL transience may be caused by competition between ICL and IWL circuits.Comment: 19 pages, 16 figure
- …