94,978 research outputs found
Backward Feature Correction: How Deep Learning Performs Deep Learning
How does a 110-layer ResNet learn a high-complexity classifier using
relatively few training examples and short training time? We present a theory
towards explaining this in terms of hierarchical learning. We refer
hierarchical learning as the learner learns to represent a complicated target
function by decomposing it into a sequence of simpler functions to reduce
sample and time complexity. This paper formally analyzes how multi-layer neural
networks can perform such hierarchical learning efficiently and automatically
by applying SGD.
On the conceptual side, we present, to the best of our knowledge, the FIRST
theory result indicating how deep neural networks can be sample and time
efficient on certain hierarchical learning tasks, when NO KNOWN
non-hierarchical algorithms (such as kernel method, linear regression over
feature mappings, tensor decomposition, sparse coding, and their simple
combinations) are efficient. We establish a principle called "backward feature
correction", where training higher layers in the network can improve the
features of lower level ones. We believe this is the key to understand the deep
learning process in multi-layer neural networks.
On the technical side, we show for every input dimension , there is a
concept class consisting of degree multi-variate polynomials so
that, using -layer neural networks as learners, SGD can learn any
target function from this class in time using
samples to any error, through
learning to represent it as a composition of layers of quadratic
functions. In contrast, we present lower bounds stating that several
non-hierarchical learners, including any kernel methods, neural tangent
kernels, must suffer from sample or time complexity to learn
this concept class even to error.Comment: V2 adds more experiments, V3 polishes writing and improves
experiments, V4 makes minor fixes to the figure
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
Model-free deep reinforcement learning algorithms have been shown to be
capable of learning a wide range of robotic skills, but typically require a
very large number of samples to achieve good performance. Model-based
algorithms, in principle, can provide for much more efficient learning, but
have proven difficult to extend to expressive, high-capacity models such as
deep neural networks. In this work, we demonstrate that medium-sized neural
network models can in fact be combined with model predictive control (MPC) to
achieve excellent sample complexity in a model-based reinforcement learning
algorithm, producing stable and plausible gaits to accomplish various complex
locomotion tasks. We also propose using deep neural network dynamics models to
initialize a model-free learner, in order to combine the sample efficiency of
model-based approaches with the high task-specific performance of model-free
methods. We empirically demonstrate on MuJoCo locomotion tasks that our pure
model-based approach trained on just random action data can follow arbitrary
trajectories with excellent sample efficiency, and that our hybrid algorithm
can accelerate model-free learning on high-speed benchmark tasks, achieving
sample efficiency gains of 3-5x on swimmer, cheetah, hopper, and ant agents.
Videos can be found at https://sites.google.com/view/mbm
The Information Complexity of Learning Tasks, their Structure and their Distance
We introduce an asymmetric distance in the space of learning tasks, and a
framework to compute their complexity. These concepts are foundational for the
practice of transfer learning, whereby a parametric model is pre-trained for a
task, and then fine-tuned for another. The framework we develop is
non-asymptotic, captures the finite nature of the training dataset, and allows
distinguishing learning from memorization. It encompasses, as special cases,
classical notions from Kolmogorov complexity, Shannon, and Fisher Information.
However, unlike some of those frameworks, it can be applied to large-scale
models and real-world datasets. Our framework is the first to measure
complexity in a way that accounts for the effect of the optimization scheme,
which is critical in Deep Learning
Incremental construction of LSTM recurrent neural network
Long Short--Term Memory (LSTM) is a recurrent neural network that
uses structures called memory blocks to allow the net remember
significant events distant in the past input sequence in order to
solve long time lag tasks, where other RNN approaches fail.
Throughout this work we have performed experiments using LSTM
networks extended with growing abilities, which we call GLSTM.
Four methods of training growing LSTM has been compared. These
methods include cascade and fully connected hidden layers as well
as two different levels of freezing previous weights in the
cascade case. GLSTM has been applied to a forecasting problem in a biomedical domain, where the input/output behavior of five
controllers of the Central Nervous System control has to be
modelled. We have compared growing LSTM results against other
neural networks approaches, and our work applying conventional
LSTM to the task at hand.Postprint (published version
- …