635 research outputs found
Feedforward and Recurrent Neural Networks Backward Propagation and Hessian in Matrix Form
In this paper we focus on the linear algebra theory behind feedforward (FNN)
and recurrent (RNN) neural networks. We review backward propagation, including
backward propagation through time (BPTT). Also, we obtain a new exact
expression for Hessian, which represents second order effects. We show that for
time steps the weight gradient can be expressed as a rank- matrix, while
the weight Hessian is as a sum of Kronecker products of rank- and
matrices, for some matrix and weight matrix . Also, we show
that for a mini-batch of size , the weight update can be expressed as a
rank- matrix. Finally, we briefly comment on the eigenvalues of the Hessian
matrix.Comment: 23 pages, 4 figure
The Outer Product Structure of Neural Network Derivatives
In this paper, we show that feedforward and recurrent neural networks exhibit
an outer product derivative structure but that convolutional neural networks do
not. This structure makes it possible to use higher-order information without
needing approximations or infeasibly large amounts of memory, and it may also
provide insights into the geometry of neural network optima. The ability to
easily access these derivatives also suggests a new, geometric approach to
regularization. We then discuss how this structure could be used to improve
training methods, increase network robustness and generalizability, and inform
network compression methods
Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks
Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable
performance in the area of speech and handwriting recognition. The performance
of an MDRNN is improved by further increasing its depth, and the difficulty of
learning the deeper network is overcome by using Hessian-free (HF)
optimization. Given that connectionist temporal classification (CTC) is
utilized as an objective of learning an MDRNN for sequence labeling, the
non-convexity of CTC poses a problem when applying HF to the network. As a
solution, a convex approximation of CTC is formulated and its relationship with
the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to
a depth of 15 layers is successfully trained using HF, resulting in an improved
performance for sequence labeling.Comment: to appear at NIPS 201
A Critical Review of Recurrent Neural Networks for Sequence Learning
Countless learning tasks require dealing with sequential data. Image
captioning, speech synthesis, and music generation all require that a model
produce outputs that are sequences. In other domains, such as time series
prediction, video analysis, and musical information retrieval, a model must
learn from inputs that are sequences. Interactive tasks, such as translating
natural language, engaging in dialogue, and controlling a robot, often demand
both capabilities. Recurrent neural networks (RNNs) are connectionist models
that capture the dynamics of sequences via cycles in the network of nodes.
Unlike standard feedforward neural networks, recurrent networks retain a state
that can represent information from an arbitrarily long context window.
Although recurrent neural networks have traditionally been difficult to train,
and often contain millions of parameters, recent advances in network
architectures, optimization techniques, and parallel computation have enabled
successful large-scale learning with them. In recent years, systems based on
long short-term memory (LSTM) and bidirectional (BRNN) architectures have
demonstrated ground-breaking performance on tasks as varied as image
captioning, language translation, and handwriting recognition. In this survey,
we review and synthesize the research that over the past three decades first
yielded and then made practical these powerful learning models. When
appropriate, we reconcile conflicting notation and nomenclature. Our goal is to
provide a self-contained explication of the state of the art together with a
historical perspective and references to primary research
A Theory of Local Learning, the Learning Channel, and the Optimality of Backpropagation
In a physical neural system, where storage and processing are intimately
intertwined, the rules for adjusting the synaptic weights can only depend on
variables that are available locally, such as the activity of the pre- and
post-synaptic neurons, resulting in local learning rules. A systematic
framework for studying the space of local learning rules is obtained by first
specifying the nature of the local variables, and then the functional form that
ties them together into each learning rule. Such a framework enables also the
systematic discovery of new learning rules and exploration of relationships
between learning rules and group symmetries. We study polynomial local learning
rules stratified by their degree and analyze their behavior and capabilities in
both linear and non-linear units and networks. Stacking local learning rules in
deep feedforward networks leads to deep local learning. While deep local
learning can learn interesting representations, it cannot learn complex
input-output functions, even when targets are available for the top layer.
Learning complex input-output functions requires local deep learning where
target information is communicated to the deep layers through a backward
learning channel. The nature of the communicated information about the targets
and the structure of the learning channel partition the space of learning
algorithms. We estimate the learning channel capacity associated with several
algorithms and show that backpropagation outperforms them by simultaneously
maximizing the information rate and minimizing the computational cost, even in
recurrent networks. The theory clarifies the concept of Hebbian learning,
establishes the power and limitations of local learning rules, introduces the
learning channel which enables a formal analysis of the optimality of
backpropagation, and explains the sparsity of the space of learning rules
discovered so far
A Survey on Methods and Theories of Quantized Neural Networks
Deep neural networks are the state-of-the-art methods for many real-world
tasks, such as computer vision, natural language processing and speech
recognition. For all its popularity, deep neural networks are also criticized
for consuming a lot of memory and draining battery life of devices during
training and inference. This makes it hard to deploy these models on mobile or
embedded devices which have tight resource constraints. Quantization is
recognized as one of the most effective approaches to satisfy the extreme
memory requirements that deep neural network models demand. Instead of adopting
32-bit floating point format to represent weights, quantized representations
store weights using more compact formats such as integers or even binary
numbers. Despite a possible degradation in predictive performance, quantization
provides a potential solution to greatly reduce the model size and the energy
consumption. In this survey, we give a thorough review of different aspects of
quantized neural networks. Current challenges and trends of quantized neural
networks are also discussed.Comment: 17 pages, 8 figure
Parallel Complexity of Forward and Backward Propagation
We show that the forward and backward propagation can be formulated as a
solution of lower and upper triangular systems of equations. For standard
feedforward (FNNs) and recurrent neural networks (RNNs) the triangular systems
are always block bi-diagonal, while for a general computation graph (directed
acyclic graph) they can have a more complex triangular sparsity pattern. We
discuss direct and iterative parallel algorithms that can be used for their
solution and interpreted as different ways of performing model parallelism.
Also, we show that for FNNs and RNNs with layers and time steps the
backward propagation can be performed in parallel in O() and O() steps, respectively. Finally, we outline the generalization of this
technique using Jacobians that potentially allows us to handle arbitrary
layers.Comment: 18 page
Mean Field Theory of Activation Functions in Deep Neural Networks
We present a Statistical Mechanics (SM) model of deep neural networks,
connecting the energy-based and the feed forward networks (FFN) approach. We
infer that FFN can be understood as performing three basic steps: encoding,
representation validation and propagation. From the meanfield solution of the
model, we obtain a set of natural activations -- such as Sigmoid, and
ReLu -- together with the state-of-the-art, Swish; this represents the expected
information propagating through the network and tends to ReLu in the limit of
zero noise.We study the spectrum of the Hessian on an associated classification
task, showing that Swish allows for more consistent performances over a wider
range of network architectures.Comment: Presented at the ICML 2019 Workshop on Theoretical Physics forDeep
Learnin
Biological credit assignment through dynamic inversion of feedforward networks
Learning depends on changes in synaptic connections deep inside the brain. In
multilayer networks, these changes are triggered by error signals fed back from
the output, generally through a stepwise inversion of the feedforward
processing steps. The gold standard for this process -- backpropagation --
works well in artificial neural networks, but is biologically implausible.
Several recent proposals have emerged to address this problem, but many of
these biologically-plausible schemes are based on learning an independent set
of feedback connections. This complicates the assignment of errors to each
synapse by making it dependent upon a second learning problem, and by fitting
inversions rather than guaranteeing them. Here, we show that feedforward
network transformations can be effectively inverted through dynamics. We derive
this dynamic inversion from the perspective of feedback control, where the
forward transformation is reused and dynamically interacts with fixed or random
feedback to propagate error signals during the backward pass. Importantly, this
scheme does not rely upon a second learning problem for feedback because
accurate inversion is guaranteed through the network dynamics. We map these
dynamics onto generic feedforward networks, and show that the resulting
algorithm performs well on several supervised and unsupervised datasets.
Finally, we discuss potential links between dynamic inversion and second-order
optimization. Overall, our work introduces an alternative perspective on credit
assignment in the brain, and proposes a special role for temporal dynamics and
feedback control during learning.Comment: 34th Conference on Neural Information Processing Systems (NeurIPS
2020), Vancouver, Canad
Activation Relaxation: A Local Dynamical Approximation to Backpropagation in the Brain
The backpropagation of error algorithm (backprop) has been instrumental in
the recent success of deep learning. However, a key question remains as to
whether backprop can be formulated in a manner suitable for implementation in
neural circuitry. The primary challenge is to ensure that any candidate
formulation uses only local information, rather than relying on global signals
as in standard backprop. Recently several algorithms for approximating backprop
using only local signals have been proposed. However, these algorithms
typically impose other requirements which challenge biological plausibility:
for example, requiring complex and precise connectivity schemes, or multiple
sequential backwards phases with information being stored across phases. Here,
we propose a novel algorithm, Activation Relaxation (AR), which is motivated by
constructing the backpropagation gradient as the equilibrium point of a
dynamical system. Our algorithm converges rapidly and robustly to the correct
backpropagation gradients, requires only a single type of computational unit,
utilises only a single parallel backwards relaxation phase, and can operate on
arbitrary computation graphs. We illustrate these properties by training deep
neural networks on visual classification tasks, and describe simplifications to
the algorithm which remove further obstacles to neurobiological implementation
(for example, the weight-transport problem, and the use of nonlinear
derivatives), while preserving performance.Comment: initial upload; revised version (updated abstract, related work)
28-09-20; 05/10/20: revised for ICLR submissio
- …