In this paper we review recent theoretical approaches for analysing the dynamics of on-line learning in multilayer neural networks using methods adopted from statistical physics. The analysis is based on monitoring a set of macroscopic variables from which the generalisation error can be calculated. A closed set of dynamical equations for the macroscopic variables is derived analytically and solved numerically. The theoretical framework is then employed for defining optimal learning parameters and for analysing the incorporation of second order information into the learning process using natural gradient descent and matrix-momentum based methods. We will also briefly explain an extension of the original framework for analysing the case where training examples are sampled with repetition

Saad, David

English

Aston Publications Explorer

The Theory of On-Line Learning -A Statistial Physis ApproahD. SaadThe Neural Computing Researh GroupUniversity of Aston, Birmingham, B4 7ET, UKAbstrat: In this paper we review reent theoretial approahes for analysingthe dynamis of on-line learning in multilayer neural networks using methodsadopted from statistial physis. The analysis is based on monitoring a set ofmarosopi variables from whih the generalisation error an be alulated. Alosed set of dynamial equations for the marosopi variables is derived an-alytially and solved numerially. The theoretial framework is then employedfor dening optimal learning parameters and for analysing the inorporationof seond order information into the learning proess using natural gradientdesent and matrix-momentum based methods. We will also briey explainan extension of the original framework for analysing the ase where trainingexamples are sampled with repetition.1 IntrodutionLayered neural networks are powerful nonlinear information proess-ing systems, apable of implementing arbitrary ontinuous and disreteinput-output maps to any desired auray, given a suÆient number ofhidden nodes and a suÆiently large example set. They have been em-ployed suessfully in a variety of regression and lassiation tasks, andhave been studied using a wide range of methods (for a review see Bishop(1995)). On-line learning refers to the iterative modiation of the net-work parameters aording to a predetermined training rule, followingsuessive presentations of single training examples, eah representinga spei input vetor and the orresponding output. On-line learn-ing is one of the leading tehniques in training large neural networks,espeially via gradient desent on a dierentiable error measure.In this review we fous on the use of methods from non-equilibriumstatistial mehanis, for analysing on-line learning in multilayer neu-ral network. We onentrate on our ontribution to this area and showhow these methods an be employed to monitor the learning dynamis,partiularly the evolution of the generalisation error, to dene optimallearning parameters and to devise and examine improved learning meth-ods. For a general review see Saad (1998) and Mae and Coolen (1998).The paper is organised as follows: In setion 2 we will derive a ompatdesription of the training dynamis using a set of marosopi variables,2setting up the main theoretial framework. This will then be employedto derive optimal training parameters (setion 3), to examine analyti-ally the eÆay of natural gradient desent (setion 4), and to suggestand examine pratial alternatives using matrix-momentum based meth-ods. In setion 5 we will explain how the method an be extended tohandle senarios where training examples are sampled with repetition.In setion 6 we will point to the main remaining open questions.2 Learning in multilayer neural networksFor setting up the basi framework, as in Saad and Solla (1995a, 1995b),we onsider a learning senario whereby a feed-forward neural networkmodel, the `student', emulates an unknown mapping, the `teaher', givenexamples of the teaher mapping (in this ase another feed-forward neu-ral network); here we restrit the derivation and the examples to thenoiseless ase although more general senarios where training examplesare orrupted by noise may also be onsidered. This provides a rathergeneral learning senario sine both student and teaher an represent avery broad lass of funtions. Student performane is typially measuredby the generalization error, whih is the student's expeted error on anunseen example. The objet of training is to minimize the generalizationerror by adapting the student network's parameters appropriately.We onsider a student mapping from an N -dimensional input spae  2IRNonto a salar funtion (J; )=PKi=1g (Ji), whih represents a softCommittee mahine (SCM - Biehl and Shwarze (1995)), where g(x)erf(x=p2) is the ativation funtion of the hidden units; J  fJig1iKis the set of input-to-hidden adaptive weights for the K hidden nodesand the hidden-to-output weights are set to one. The ativation ofhidden node i in the student under presentation of the input patternis denoted xi= Ji. This onguration preserves most propertiesof a general multi-layer network and an be extended to aommodateadaptive hidden-to-output weights as shown by Riegler and Biehl (1995).Training examples are of the form (; ) where =1; 2; :: labels eahindependently drawn example in a sequene. Components of the in-dependently drawn input vetors are unorrelated random variableswith zero mean and unit variane. The orresponding output is givenby a teaher of a similar onguration to the student exept for a pos-sible dierene in the number M of hidden units: =PMn=1g (Bn),where BfBng1nMis the set of input-to-hidden adaptive weights forteaher hidden nodes. The ativation of hidden node n in the teaherunder presentation of the input pattern is denoted yn= Bn .Indies i; j; k and n;m refer to student and teaher units respetively.3The error made by the student is given by the quadrati deviation,(J; ) 12[ (J; )  ℄2=12KXi=1g(xi) MXn=1g(yn)2: (1)This training error is then used to dene the learning dynamis via a gra-dient desent rule for the update of student weights J+1i= Ji+NÆi,where Æi g0(xi)[PMn=1g(yn) PKj=1g(xj)℄ and the learning rate  hasbeen saled with the input size N . Performane on a typial input de-nes the generalization error g(J)h(J; )ifgthrough an average overall possible input vetors .Expressions for the generalization error and learning dynamis havebeen obtained in the thermodynami limit (N !1), and an be rep-resented by a set of marosopi variables (order parameters) of theform: JiJkQik, JiBnRin, and BnBmTnm, measuring overlapsbetween student and teaher vetors. The overlaps R and Q beomethe dynamial variables of the system while T is dened by the task.The learning dynamis is then dened in terms of dierential equationsfor the marosopi variables with respet to the normalized number ofexamples  = =N playing the role of a ontinuous time variable:dRind=  in;dQikd=   ik+ 2ik; (2)where in hÆiynifg,  ik hÆixk+ Ækxiifgand ik hÆiÆkifg. Theexpliit expressions for in,  ik, ikand gdepend exlusively on theoverlaps Q;R and T (Saad and Solla (1995a,1995b)). Equations (2), de-pend on a losed set of parameters and an be integrated and iterativelysolved, providing a full desription of the order parameters evolutionfrom whih the evolution of the generalization error an be derived.Typial plots of the learning dynamis are presented in Fig.1. In this ex-ample the learning proess prunes unneessary nodes when the studentnetwork has exessive resoures. A teaher with M = 2 hidden unitsharaterized by Tnm=n Ænmis to be learned by a student with K=3hidden units. The initial values of the order parameters are Rin=0 forall i; n, Qik=0 for all i 6=k, while the norms Qiiof the student vetorsare initialized independently from a uniform distribution in the [0; 0:5℄interval. The time evolution of the various order parameters is shownin Fig. 1a- for =1. The piture that emerges is one of speializationwith inreasing ; asymptotially the rst student node imitates the rstteaher node (Q11=R11=T11) while ignoring the seond one (R12=0),the seond student node imitates the seond teaher node while ignoringthe rst one, and the third student node gets eliminated (Q33=0). Theo-diagonal omponents Qikshown in Fig.1b indiate that the two sur-viving student vetors beome inreasingly unorrelated. The overlap4Figure 1: Dependene of the overlaps and gon the normalized numberof examples , for K=3 and M=2: (a) the lengths of student vetors,(b) the orrelation between student vetors, () the overlap betweenvarious student and teaher vetors, and (d) the generalization error.between student and teaher vetors (Fig.1) displays a small  behav-ior dominated by an undierentiated symmetri solution, followed bya transition onto the speialization required to obtain perfet general-ization. The evolution of the generalization error is shown in Fig.1d.3 Optimal learning parametersOn-line methods are often sensitive to the hoie of learning parametersand in partiular the hoie of learning rate; if hosen too large thelearning proess may diverge, but if  is too small then onvergene antake an extremely long time. The optimal learning rate will also varysubstantially over time and may require annealing asymptotially. Mostexisting analytial results for dening optimal learning rates onentrateon the asymptoti regime where the system may be linearized.The naive approah to learning rate optimization is to onsider thefastest rate of derease in generalization error as a measure of opti-mality. To nd the loally optimal learning rate one minimizes dg=d,5using Eqs.(2), exploiting the fat that the hange in gover time dependsexlusively on the overlaps. The expression obtained for the loally op-timal learning rate may be useful for some phases of the learning proessbut is useless for others (Rattray and Saad (1998)).A more appropriate measure of optimality is the total redution in gen-eralization error over the entire learning proess as in Saad and Rattray(1997). With this measure one an then dene the globally optimal learn-ing rate in a given time-window [0; 1℄ to be that whih provides thelargest derease in generalization error between these two times:g() =Z10dgdd =Z10L(; ) d : (3)Sine the generalization error depends solely on the overlaps Q, R andT , whih are the dynamial variables (T remains xed here), we anexpand the integrand in terms of these variables,L(; ) =XingRindRind+XikgQikdQikd(4) XinindRind   in XikikdQikd    ik  2ik:The last two terms in equation (4) fore the orret dynamis using setsof Lagrange multipliers inand ikorresponding to the equations of mo-tion for Rinand Qikrespetively. Variational minimization of the inte-gral in equation (3) with respet to the dynamial variables leads to a setof oupled dierential equations for the Lagrange multipliers along witha set of boundary onditions. Solving these equations over the interval[0; 1℄ determines neessary onditions for  to maximize g(). Thetheory is ompletely general and may be employed for dierent learn-ing parameters (e.g., regularization parameters as in Saad and Rattray(1998), site dependent learning rates), various learning senarios (stru-turally unrealisable or where examples are orrupted by noise) and forobtaining optimal learning rules (Rattray and Saad (1997).4 Natural Gradient DesentThe same theoretial framework may be used for examining novel train-ing methods. Natural gradient desent (NGD) was reently proposedby Amari (1998) as a prinipled alternative to standard on-line gradientdesent (GD). When learning to emulate a stohasti rule with someprobabilisti model, e.g. a feed-forward neural network, NGD has thedesirable properties of asymptoti optimality, given a suÆiently rihmodel whih is dierentiable with respet to its parameters, and invari-ane to re-parameterization of our model distribution. These properties6are ahieved by viewing the parameter spae of the model as a Rieman-nian spae in whih loal distane is dened by the Kullbak-Leiblerdivergene. The Fisher information matrix provides the appropriatemetri in this spae. If the training error is dened as the negative log-likelihood of the data under our probabilisti model, then the diretionof steepest desent in this Riemannian spae is found by premultiplyingthe error gradient with the inverse of the Fisher information matrix; thisdenes the NGD learning diretion.Studying the learning performane of NGD in the ase of isotropi tasksand struturally mathed student and teaher (K =M and T = TÆnm)we determined generi behaviour in terms of task omplexityK and non-linearity T (Rattray et al (1998)). An analysis of the transient, usingglobally optimal learning parameters reveals that trapping time in thesymmetri phase for the NGD optimized system sales as K2, omparedto a saling of K8=3for optimal GD. Asymptotially, NGD saturates theuniversal bounds on generalization performane and provides a signi-ant improvement over optimized GD, espeially for small T .However, in pratial appliations there will be an inreased ost re-quired in estimating and inverting the Fisher information matrix as itrequires an average over the input distribution and a matrix inversion.An on-line matrix momentum algorithm (Orr and Leen (1994)) was in-trodued in order to invert an estimate of the Hessian eÆiently on-line.We propose to use this method to ompute the inverse of the Fisherinformation matrix as required for NGD. This method is partiularlyeÆient sine the inversion is replaed by a matrix-vetor multipliationwhih an be arried out by a bak-propagation step. Sine the trueFisher information matrix will not be known in general we use a singlestep approximation of it, whih an be determined on-line. We om-pared the eÆieny of the proposed matrix momentum NGD with thatof standard GD and true NGD in training two-layer networks. It turnsout to provide a signiant improvement over gradient desent learningbut with some sensitivity to parameter hoie, due to noise in the Fisherinformation estimate (Sarpetta et al (1999)).5 Restrited Training SetsIn a realisti senario the number of training examples sales with thenumber of free parameters, and examples are therefore sampled withrepetition. This gives rise to orrelations between the network parame-ters and the training examples, whih learly aet the learning proess.One of the most signiant aspets of having a xed example set is thedistintion between the two key performane measures: the training er-ror, measuring the network performane with respet to the restritedtraining set, and the test (generalisation) error, alulated for all pos-7sible inputs sampled from the true distribution. The former may bemonitored in pratial training senarios, while the latter an only beassessed. Another important aspet of learning from restrited trainingsets whih have been orrupted by noise is the emergene of overttingand the need to employ regularization tehniques (e.g., weight deay,early stopping - see Bishop (1995)).The fundamental dierene between the innite and restrited trainingset senarios is that the joint probability distribution P (x;y) for thestudent and teaher node ativations, whih is Gaussian in the formerase, takes here a more general form, whih depends on the trainingpatterns and hanges dynamially during the learning proess. In fat,we dene P (x;y) as one of the marosopi variables to be monitoredontinuously, together with the overlaps R and Q (Coolen and Saad(2000)). To follow the dynamis, one derives a set of oupled dierentialequations desribing the evolution of the marosopi variables in thelimit N!1. This set of equations annot be losed in general; losureis obtained by invoking the dynamial replia theory. The resultingequations an be solved numerially with some simpliations.The solutions desribe the dynamis of both training and generalizationerrors (and the various overlaps, Coolen et al (2000), Xiong and Saad(2000)), provide insight to the link between the number of examples andthe breaking of internal symmetries as well as some asymptoti salinglaws. Our ability to provide analytial solutions is limited due to theomplexity of the equations; however, suh solutions are highly desirablefor deriving analytially generi saling laws in both the symmetri phaseand asymptotially, and to make a quantitative link between the noiselevel and the optimal regularization to be used.6 ConlusionWe showed how the methods of statistial physis an provide insightinto the dynamis of on-line learning as well as play an important rolein dening optimal learning parameters and in examining the propertiesof new learning algorithms. Several open questions remain, for instane,nding prinipled methods for optimising the generalisation ability inthe ase of restrited training sets and the dependene of the length ofthe symmetri phase on the number of training examples.ReferenesAMARI, S. (1998): Natural Gradient Works EÆiently in Learning.Neural Computation, Vol. 10, 251{276.BIEHL, M. and SCHWARZE, H. (1995): Learning by Online GradientDesent. Jour. Phys. A, Vol. 28, 643{656.8BISHOP, C. M. (1995): Neural Networks for Pattern Reognition. Ox-ford University Press, Oxford.COOLEN, A. C. C. and SAAD, D. (2000): Dynamis of Learning withRestrited Training Sets. Phys. Rev. E., Vol. 62, 5444{5487.COOLEN, A. C. C., SAAD, D. and XIONG, Y. (2000): On-line Learn-ing from Restrited Training Sets in Multilayer Neural Networks. Euro-phys. Lett., Vol. 51, 691{697.MACE, C. W. H. and COOLEN, A. C. C. (1998): Statistial MehanialAnalysis of the Dynamis of Learning in Pereptrons. Statistis andComputing, Vol. 8 55{88.ORR, G. B. and LEEN, T. K. (1994):Using Curvature Information forFast Stohasti Searh. in Cowan, Tesauro and Alspetor (Eds.): Ad-vanes in Neural Information Proessing Systems, NIPS Vol. 6, MorganKaufmann, San Mateo CA, 477{484.RATTRAY, M. and SAAD, D. (1997): Globally Optimal Rules for On-line Learning in Multilayer Networks. Jour. Phys. A, Vol. 30, L771{776.RATTRAY, M. and SAAD, D. (1998): An analysis of on-line trainingwith optimal learning rates. Phys. Rev. E., Vol. 58, 6379{6391.RATTRAY, M., SAAD, D. and AMARI, S. (1998): Natural GradientDesent for On-line Learning. Phys. Rev. Lett., Vol. 81, 5461{5464.RIEGLER, P. and BIEHL, M. (1995): Online Bakpropagation in TwoLayered Neural Networks. Jour. Phys. A, Vol. 28, L507{513.SAAD, D. (Editor) (1998): On-Line Learning in Neural Networks. Pub-liations of the Newton Institute, Cambridge University Press, Cam-bridge.SAAD, D. and RATTRAY, M. (1997): Globally Optimal Parametersfor On-line Learning in Multilayer Networks. Phys. Rev. Lett., Vol. 79,2578{2581.SAAD, D. and RATTRAY, M. (1998): Learning with Regularizers inMultilayer Neural Networks. Phys. Rev. E., Vol. 57, 2170{2176.SAAD, D. and SOLLA, S. A. (1995): Exat Solution for On-Line Learn-ing in Multilayer Neural Networks. Phys. Rev. Lett., Vol. 74, 4337{4340.SAAD, D. and SOLLA, S. A. (1995): On-Line Learning in Soft Com-mittee Mahines. Phys. Rev. E, Vol. 52, 4225{4243.SCARPETTA, S., RATTRAY, M. and SAAD, D. (1999): Matrix Mo-mentum for Pratial Natural Gradient Learning. Jour. Phys. A, Vol.32, 4047{4059.XIONG, Y. and SAAD, D. (2001): Noise, Regularizers and UnrealizableSenarios in On-line Learning From Restrited Training Sets. submitted.

The theory of on-line learning: a statistical physics approach

https://publications.aston.ac.uk/id/eprint/1302/1/NCRG_2001_010.pdf

The theory of on-line learning: a statistical physics approach

Abstract

Similar works

Full text

Available Versions

Aston Publications Explorer