526,307 research outputs found
Expressing linear equality constraints in feedforward neural networks
We seek to impose linear, equality constraints in feedforward neural
networks. As top layer predictors are usually nonlinear, this is a difficult
task if we seek to deploy standard convex optimization methods and strong
duality. To overcome this, we introduce a new saddle-point Lagrangian with
auxiliary predictor variables on which constraints are imposed. Elimination of
the auxiliary variables leads to a dual minimization problem on the Lagrange
multipliers introduced to satisfy the linear constraints. This minimization
problem is combined with the standard learning problem on the weight matrices.
From this theoretical line of development, we obtain the surprising
interpretation of Lagrange parameters as additional, penultimate layer hidden
units with fixed weights stemming from the constraints. Consequently, standard
minimization approaches can be used despite the inclusion of Lagrange
parameters -- a very satisfying, albeit unexpected, discovery. Examples ranging
from multi-label classification to constrained autoencoders are envisaged in
the future
Learning and generalization in radial basis function networks
The aim of supervised learning is to approximate an unknown target function
by adjusting the parameters of a learning model in response to possibly noisy
examples generated by the target function. The performance of the learning model
at this task can be quantified by examining its generalization ability. Initially the
concept of generalization is reviewed, and various methods of measuring it, such as
generalization error, prediction error, PAC learning and the evidence, are discussed
and the relations between them examined. Some of these relations are dependent
on the architecture of the learning model.Two architectures are prevalent in practical supervised learning: the multi -layer
perceptron (MLP) and the radial basis function network (RBF). While the RBF
has previously been examined from a worst -case perspective, this gives little insight
into the performance and phenomena that can be expected in the typical case.
This thesis focusses on the properties of learning and generalization that can be
expected on average in the RBF.There are two methods in use for training the RBF. The basis functions can be
fixed in advance, utilising an unsupervised learning algorithm, or can adapt during
the training process. For the case in which the basis functions are fixed, the
typical generalization error given a data set of particular size is calculated by
employing the Bayesian framework. The effects of noisy data and regularization
are examined, the optimal settings of the parameters that control the learning
process are calculated, and the consequences of a mismatch between the learning
model and the data -generating mechanism are demonstrated.The second case, in which the basis functions are adapted, is studied utilising the
on -line learning paradigm. The average evolution of generalization error is calculated in a manner which allows the phenomena of the learning process, such as the
specialization of the basis functions, to be eludicated. The three most important
stages of training: the symmetric phase, the symmetry- breaking phase and the
convergence phase, are analyzed in detail; the convergence phase analysis allows
the derivation of maximal and optimal learning rates. Noise on both the inputs
and outputs of the data -generating mechanism is introduced, and the consequences
examined. Regularization via weight decay is also studied, as are the effects of the
learning model being poorly matched to the data generator
On-line learning of non-monotonic rules by simple perceptron
We study the generalization ability of a simple perceptron which learns
unlearnable rules. The rules are presented by a teacher perceptron with a
non-monotonic transfer function. The student is trained in the on-line mode.
The asymptotic behaviour of the generalization error is estimated under various
conditions. Several learning strategies are proposed and improved to obtain the
theoretical lower bound of the generalization error.Comment: LaTeX 20 pages using IOP LaTeX preprint style file, 14 figure
Effect of time-correlation of input patterns on the convergence of on-line learning
We studied the effects of time correlation of subsequent patterns on the
convergence of on-line learning by a feedforward neural network with
backpropagation algorithm. By using chaotic time series as sequences of
correlated patterns, we found that the unexpected scaling of converging time
with learning parameter emerges when time-correlated patterns accelerate
learning process.Comment: 8 pages(Revtex), 5 figure
Powerpropagation: A sparsity inducing weight reparameterisation
The training of sparse neural networks is becoming an increasingly important tool
for reducing the computational footprint of models at training and evaluation, as
well enabling the effective scaling up of models. Whereas much work over the
years has been dedicated to specialised pruning techniques, little attention has
been paid to the inherent effect of gradient based training on model sparsity. In
this work, we introduce Powerpropagation, a new weight-parameterisation for
neural networks that leads to inherently sparse models. Exploiting the behaviour
of gradient descent, our method gives rise to weight updates exhibiting a ârich get
richerâ dynamic, leaving low-magnitude parameters largely unaffected by learning.
Models trained in this manner exhibit similar performance, but have a distribution
with markedly higher density at zero, allowing more parameters to be pruned safely.
Powerpropagation is general, intuitive, cheap and straight-forward to implement
and can readily be combined with various other techniques. To highlight its versatility, we explore it in two very different settings: Firstly, following a recent
line of work, we investigate its effect on sparse training for resource-constrained
settings. Here, we combine Powerpropagation with a traditional weight-pruning
technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing
superior performance on the ImageNet benchmark. Secondly, we advocate the use
of sparsity in overcoming catastrophic forgetting, where compressed representations allow accommodating a large number of tasks at fixed model capacity. In all
cases our reparameterisation considerably increases the efficacy of the off-the-shelf
methods
A mathematical analysis of the effects of Hebbian learning rules on the dynamics and structure of discrete-time random recurrent neural networks
We present a mathematical analysis of the effects of Hebbian learning in
random recurrent neural networks, with a generic Hebbian learning rule
including passive forgetting and different time scales for neuronal activity
and learning dynamics. Previous numerical works have reported that Hebbian
learning drives the system from chaos to a steady state through a sequence of
bifurcations. Here, we interpret these results mathematically and show that
these effects, involving a complex coupling between neuronal dynamics and
synaptic graph structure, can be analyzed using Jacobian matrices, which
introduce both a structural and a dynamical point of view on the neural network
evolution. Furthermore, we show that the sensitivity to a learned pattern is
maximal when the largest Lyapunov exponent is close to 0. We discuss how neural
networks may take advantage of this regime of high functional interest
The Importance of Clipping in Neurocontrol by Direct Gradient Descent on the Cost-to-Go Function and in Adaptive Dynamic Programming
In adaptive dynamic programming, neurocontrol and reinforcement learning, the
objective is for an agent to learn to choose actions so as to minimise a total
cost function. In this paper we show that when discretized time is used to
model the motion of the agent, it can be very important to do "clipping" on the
motion of the agent in the final time step of the trajectory. By clipping we
mean that the final time step of the trajectory is to be truncated such that
the agent stops exactly at the first terminal state reached, and no distance
further. We demonstrate that when clipping is omitted, learning performance can
fail to reach the optimum; and when clipping is done properly, learning
performance can improve significantly.
The clipping problem we describe affects algorithms which use explicit
derivatives of the model functions of the environment to calculate a learning
gradient. These include Backpropagation Through Time for Control, and methods
based on Dual Heuristic Dynamic Programming. However the clipping problem does
not significantly affect methods based on Heuristic Dynamic Programming,
Temporal Differences or Policy Gradient Learning algorithms. Similarly, the
clipping problem does not affect fixed-length finite-horizon problems
Optimization of the Asymptotic Property of Mutual Learning Involving an Integration Mechanism of Ensemble Learning
We propose an optimization method of mutual learning which converges into the
identical state of optimum ensemble learning within the framework of on-line
learning, and have analyzed its asymptotic property through the statistical
mechanics method.The proposed model consists of two learning steps: two
students independently learn from a teacher, and then the students learn from
each other through the mutual learning. In mutual learning, students learn from
each other and the generalization error is improved even if the teacher has not
taken part in the mutual learning. However, in the case of different initial
overlaps(direction cosine) between teacher and students, a student with a
larger initial overlap tends to have a larger generalization error than that of
before the mutual learning. To overcome this problem, our proposed optimization
method of mutual learning optimizes the step sizes of two students to minimize
the asymptotic property of the generalization error. Consequently, the
optimized mutual learning converges to a generalization error identical to that
of the optimal ensemble learning. In addition, we show the relationship between
the optimum step size of the mutual learning and the integration mechanism of
the ensemble learning.Comment: 13 pages, 3 figures, submitted to Journal of Physical Society of
Japa
Ensemble learning of linear perceptron; Online learning theory
Within the framework of on-line learning, we study the generalization error
of an ensemble learning machine learning from a linear teacher perceptron. The
generalization error achieved by an ensemble of linear perceptrons having
homogeneous or inhomogeneous initial weight vectors is precisely calculated at
the thermodynamic limit of a large number of input elements and shows rich
behavior. Our main findings are as follows. For learning with homogeneous
initial weight vectors, the generalization error using an infinite number of
linear student perceptrons is equal to only half that of a single linear
perceptron, and converges with that of the infinite case with O(1/K) for a
finite number of K linear perceptrons. For learning with inhomogeneous initial
weight vectors, it is advantageous to use an approach of weighted averaging
over the output of the linear perceptrons, and we show the conditions under
which the optimal weights are constant during the learning process. The optimal
weights depend on only correlation of the initial weight vectors.Comment: 14 pages, 3 figures, submitted to Physical Review
- âŠ