The Levenberg-Marquardt (LM) learning algorithm is a popular algorithm for training neural networks; however, for large neural networks, it becomes prohibitively expensive in terms of running time and memory requirements. The most time-critical step of the algorithm is the calculation of the Gauss-Newton matrix, which is formed by multiplying two large Jacobian matrices together. We propose a method that uses back-propagation to reduce the time of this matrix-matrix multiplication. This reduces the overall asymptotic running time of the LM algorithm by a factor of the order of the number of output nodes in the neural network

Alonso, E.

Fairbank, M.

City Research Online

Fairbank, M. & Alonso, E. (2012). Efficient Calculation of the Gauss-Newton Approximation of the Hessian Matrix in Neural Networks. Neural Computation, 24(3), pp. 607-610. doi: 10.1162/NECO_a_00248 City Research OnlineOriginal citation: Fairbank, M. & Alonso, E. (2012). Efficient Calculation of the Gauss-Newton Approximation of the Hessian Matrix in Neural Networks. Neural Computation, 24(3), pp. 607-610. doi: 10.1162/NECO_a_00248 Permanent City Research Online URL: http://openaccess.city.ac.uk/4369/ Copyright & reuseCity University London has developed City Research Online so that its users may access the research outputs of City University London's staff. Copyright © and Moral Rights for this paper are retained by the individual author(s) and/ or other copyright holders.  All material in City Research Online is checked for eligibility for copyright before being made available in the live archive. URLs from City Research Online may be freely distributed and linked to from other web pages. Versions of researchThe version in City Research Online may differ from the final published version. Users are advised to check the Permanent City Research Online URL above for the status of the paper.EnquiriesIf you have any enquiries about any aspect of City Research Online, or if you wish to make contact with the author(s) of this paper, please email the team at publications@city.ac.uk.NOTE Communicated by Nicol SchraudolphEfficient Calculation of the Gauss-Newton Approximation ofthe Hessian Matrix in Neural NetworksMichael Fairbankmichael.fairbank1@city.ac.ukEduardo AlonsoE.Alonso@city.ac.ukDepartment of Computing, School of Informatics, City University London, LondonEC1V 0HB, U.K.The Levenberg-Marquardt (LM) learning algorithm is a popular algo-rithm for training neural networks; however, for large neural networks,it becomes prohibitively expensive in terms of running time and memoryrequirements. The most time-critical step of the algorithm is the calcula-tion of the Gauss-Newton matrix, which is formed by multiplying twolarge Jacobian matrices together. We propose a method that uses back-propagation to reduce the time of this matrix-matrix multiplication. Thisreduces the overall asymptotic running time of the LM algorithm by afactor of the order of the number of output nodes in the neural network.1 IntroductionA neural network is a smooth function y = y(x, w) that maps an inputcolumn vector x to an output column vector y and where w is a parametervector known as the weight vector.For the specific input and output vectors xp and yp, corresponding to atraining pattern p, the Jacobian matrix of the neural network is defined tobe Jp =∂yp∂w, which is a matrix with element (i, j) equal to∂(yp)i∂(w) j. The Gauss-Newton matrix is defined to be G = ∑p Gp, where Gp = JpT Jp. We definenw = dim(w), no = dim(y) and np as the number of training patterns. ThenJp is a no × nw matrix, and so forming the matrix G by direct matrix mul-tiplication and summation over all patterns would take 2nonpnw2 floatingpoint operations (flops), ignoring lower power terms.We define a technique that can calculate the G matrix in the faster timeof approximately 3npnw2 flops (ignoring lower-power terms). This fasteralgorithm is related to the method of Schraudolph (2002) and exploits a trickthat backpropagation (Werbos, 1974; Rumelhart, Hinton, & Williams, 1986)can be used to quickly multiply an arbitrary column vector on the left by JpT .Neural Computation 24, 607–610 (2012) c© 2011 Massachusetts Institute of Technology608 M. Fairbank and E. AlonsoForming the G matrix is important because it is central to the Levenberg-Marquardt (LM) training algorithm (Levenberg, 1944; Marquardt, 1963).The LM algorithm uses a weight update that requires the inverse of G. De-tails are given by Bishop (1995). Since G ∈ ℜnw×nw , the inversion of G willtake time O(nw3), and since usually np ≫ nw, it turns out that the formationof the matrix G is usually slower than its inversion. Hence, our algorithm isreducing the asymptotic time of the most time-critical step of the LM algo-rithm. Previous research to speed up the formation of G has concentratedon parallel implementations (Suri, Deodhare, & Nagabhushan, 2002).2 The TechniqueBackpropagation is an algorithm to calculate the gradient∂Ep∂wvery efficientlyfor a given pattern p and error function, Ep. If we assume the computationsat the nodes of the network are dwarfed by those at the network weights,then the backpropagation algorithm takes 3nw flops per pattern.By the chain rule,∂Ep∂w= ∂y∂wT ∂Ep∂y= JpT∂Ep∂y. Hence, we see that backpropa-gation can be used to multiply a column vector,∂Ep∂y, very efficiently on theleft by the transposed Jacobian matrix. The choice of column vector here isarbitrary; it does not have to specifically be∂Ep∂y. This is the trick we use tocreate our fast algorithm for calculating G.A standard method to calculate the Jacobian matrix is as follows. Tocalculate the ith row of Jp, we use backpropagation to multiply JpT by the ithcolumn of I, an no × no identity matrix. Repeating this for all i ∈ {1, 2, . . . ,no}outputs will calculate the full Jp matrix in 3nonw flops.The new method to calculate the Gp matrix is as follows. Since Gp = JpT Jp,the ith column of Gp is equal to the product of the matrix JpT with the ithcolumn of Jp. Hence each column of Gp can be calculated using one pass ofbackpropagation. Therefore, calculating the whole Gp matrix from a givenJp matrix takes 3nw2 flops.In addition to the time taken to calculate Jp and Gp, we also need oneinitial forward pass through the network, which will take 2nw flops. Hence,the total flop count to calculate G, when summing over all np patterns, isnp(2nw + 3nonw + 3nw2). Since usually no ≤√nw, the most significant termhere is 3npnw2 flops.3 DiscussionSince the work of Schraudolph (2002) allows fast multiplication of the Gmatrix by an arbitrary column vector, in time 7npnw flops, it would betrivial to extend that work to form the full G matrix column by column.Efficient Calculation of the Gauss-Newton Matrix in Neural Networks 609This would give an asymptotically equivalent algorithm to ours, but in aslower absolute flop count of 7npnw2.The calculation time of the direct multiplication method and our methodcould both be halved further by exploiting the symmetry of G.Our calculations indicate that while Strassen multiplication (Huss-Lederman, Jacobson, Tsao, Turnbull, & Johnson, 1996) is not useful in calcu-lating Gp for a single pattern, it does confer an asymptotic advantage whencalculating G for all patterns in a single outer product. However, doing sois memory intensive and significantly more complicated to implement thanour method.We have not considered hardware acceleration and caching issues, bothof which would likely favor conventional matrix multiplication over ourmethod.4 ConclusionWe have presented a way to use backpropagation to reduce the time takento calculate the Gauss-Newton matrix in Levenberg-Marquardt down by afactor proportional to no. This reduces the critical time step in implement-ing the LM algorithm and so could be a useful tool to optimize any LMimplementation where no ≫ 1.AcknowledgmentsWe are grateful to the anonymous reviewers for their suggestions for thisnote.ReferencesBishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford Uni-versity Press.Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull, T., & Johnson, J. R. (1996).Implementation of Strassen’s algorithm for matrix multiplication. In Proceedingsof the 1996 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing’96. Washington, DC: IEEE Computer Society.Levenberg, K. (1944). A method for the solution of certain non-linear problems inleast squares. Quart. Appl. Math., 2, 164–168.Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinearparameters. Journal of the Society for Industrial and Applied Mathematics, 11, 431–441.Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 6088, 533–536.Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-ordergradient descent. Neural Computation, 14(7), 1723–1738.610 M. Fairbank and E. AlonsoSuri, N.N.R.R., Deodhare, D., & Nagabhushan, P. (2002). Parallel Levenberg-Marquardt-based neural network training on Linux clusters—a case study. InLinux Clusters, ICVGIP 2002, 3rd Indian Conference on Computer Vision, Graphicsand Image Processing. N.P.: Allied Publishers.Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in thebehavioral sciences. Unpublished doctoral dissertation, Harvard University.Received February 25, 2011; accepted September 16, 2011.

Efficient Calculation of the Gauss-Newton Approximation of the Hessian Matrix in Neural Networks

The Levenberg-Marquardt (LM) learning algorithm is a popular algorithm for training neural networks; however, for large neural networks, it becomes prohibitively expensive in terms of running time and memory requirements. The most time-critical step of the algorithm is the calculation of the Gauss-Newton matrix, which is formed by multiplying two large Jacobian matrices together. We propose a method that uses backpropagation to reduce the time of this matrix-matrix multiplication. This reduces the overall asymptotic running time of the LM algorithm by a factor of the order of the number of output nodes in the neural network

Fairbank, Michael

Alonso, Eduardo

University of Essex Research Repository

NOTE Communicated by Nicol SchraudolphEfficient Calculation of the Gauss-Newton Approximation ofthe Hessian Matrix in Neural NetworksMichael Fairbankmichael.fairbank1@city.ac.ukEduardo AlonsoE.Alonso@city.ac.ukDepartment of Computing, School of Informatics, City University London, LondonEC1V 0HB, U.K.The Levenberg-Marquardt (LM) learning algorithm is a popular algo-rithm for training neural networks; however, for large neural networks,it becomes prohibitively expensive in terms of running time andmemoryrequirements. The most time-critical step of the algorithm is the calcula-tion of the Gauss-Newton matrix, which is formed by multiplying twolarge Jacobian matrices together. We propose a method that uses back-propagation to reduce the time of this matrix-matrix multiplication. Thisreduces the overall asymptotic running time of the LM algorithm by afactor of the order of the number of output nodes in the neural network.1 IntroductionA neural network is a smooth function y = y(x, w) that maps an inputcolumn vector x to an output column vector y and where w is a parametervector known as the weight vector.For the specific input and output vectors xp and yp, corresponding to atraining pattern p, the Jacobian matrix of the neural network is defined tobe Jp =∂yp∂w, which is a matrix with element (i, j) equal to∂(yp)i∂(w) j. The Gauss-Newton matrix is defined to be G = ∑p Gp, where Gp = JpTJp. We definenw = dim(w), no = dim(y) and np as the number of training patterns. ThenJp is a no × nw matrix, and so forming the matrix G by direct matrix mul-tiplication and summation over all patterns would take 2nonpnw2 floatingpoint operations (flops), ignoring lower power terms.We define a technique that can calculate the G matrix in the faster timeof approximately 3npnw2 flops (ignoring lower-power terms). This fasteralgorithm is related to themethod of Schraudolph (2002) and exploits a trickthat backpropagation (Werbos, 1974; Rumelhart, Hinton, &Williams, 1986)can be used to quicklymultiply an arbitrary columnvector on the left by JpT .Neural Computation 24, 607–610 (2012) c© 2011 Massachusetts Institute of Technology608 M. Fairbank and E. AlonsoForming theGmatrix is important because it is central to the Levenberg-Marquardt (LM) training algorithm (Levenberg, 1944; Marquardt, 1963).The LM algorithm uses a weight update that requires the inverse of G. De-tails are given by Bishop (1995). Since G ∈ nw×nw , the inversion of G willtake time O(nw3), and since usually np  nw, it turns out that the formationof the matrix G is usually slower than its inversion. Hence, our algorithm isreducing the asymptotic time of the most time-critical step of the LM algo-rithm. Previous research to speed up the formation of G has concentratedon parallel implementations (Suri, Deodhare, & Nagabhushan, 2002).2 The TechniqueBackpropagation is analgorithm to calculate thegradient∂Ep∂wvery efficientlyfor a given pattern p and error function, Ep. If we assume the computationsat the nodes of the network are dwarfed by those at the network weights,then the backpropagation algorithm takes 3nw flops per pattern.By the chain rule,∂Ep∂w= ∂y∂wT ∂Ep∂y = JpT∂Ep∂y . Hence, we see that backpropa-gation can be used to multiply a column vector,∂Ep∂y , very efficiently on theleft by the transposed Jacobian matrix. The choice of column vector here isarbitrary; it does not have to specifically be∂Ep∂y . This is the trick we use tocreate our fast algorithm for calculating G.A standard method to calculate the Jacobian matrix is as follows. Tocalculate the ith row of Jp, we use backpropagation tomultiply JpT by the ithcolumn of I, an no × no identitymatrix. Repeating this for all i ∈ {1, 2, . . . ,no}outputs will calculate the full Jp matrix in 3nonw flops.The newmethod to calculate theGp matrix is as follows. SinceGp = JpTJp,the ith column of Gp is equal to the product of the matrix JpT with the ithcolumn of Jp. Hence each column of Gp can be calculated using one pass ofbackpropagation. Therefore, calculating the whole Gp matrix from a givenJp matrix takes 3nw2 flops.In addition to the time taken to calculate Jp and Gp, we also need oneinitial forward pass through the network, which will take 2nw flops. Hence,the total flop count to calculate G, when summing over all np patterns, isnp(2nw + 3nonw + 3nw2). Since usually no ≤√nw, the most significant termhere is 3npnw2 flops.3 DiscussionSince the work of Schraudolph (2002) allows fast multiplication of the Gmatrix by an arbitrary column vector, in time 7npnw flops, it would betrivial to extend that work to form the full G matrix column by column.Efficient Calculation of the Gauss-Newton Matrix in Neural Networks 609This would give an asymptotically equivalent algorithm to ours, but in aslower absolute flop count of 7npnw2.The calculation time of the direct multiplicationmethod and ourmethodcould both be halved further by exploiting the symmetry of G.Our calculations indicate that while Strassen multiplication (Huss-Lederman, Jacobson, Tsao, Turnbull, & Johnson, 1996) is not useful in calcu-lating Gp for a single pattern, it does confer an asymptotic advantage whencalculating G for all patterns in a single outer product. However, doing sois memory intensive and significantly more complicated to implement thanour method.We have not considered hardware acceleration and caching issues, bothof which would likely favor conventional matrix multiplication over ourmethod.4 ConclusionWe have presented a way to use backpropagation to reduce the time takento calculate the Gauss-Newton matrix in Levenberg-Marquardt down by afactor proportional to no. This reduces the critical time step in implement-ing the LM algorithm and so could be a useful tool to optimize any LMimplementation where no  1.AcknowledgmentsWe are grateful to the anonymous reviewers for their suggestions for thisnote.ReferencesBishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford Uni-versity Press.Huss-Lederman, S., Jacobson, E. M., Tsao, A., Turnbull, T., & Johnson, J. R. (1996).Implementation of Strassen’s algorithm for matrix multiplication. In Proceedingsof the 1996 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing’96. Washington, DC: IEEE Computer Society.Levenberg, K. (1944). A method for the solution of certain non-linear problems inleast squares. Quart. Appl. Math., 2, 164–168.Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinearparameters. Journal of the Society for Industrial and Applied Mathematics, 11, 431–441.Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by back-propagating errors. Nature, 6088, 533–536.Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-ordergradient descent. Neural Computation, 14(7), 1723–1738.610 M. Fairbank and E. AlonsoSuri, N.N.R.R., Deodhare, D., & Nagabhushan, P. (2002). Parallel Levenberg-Marquardt-based neural network training on Linux clusters—a case study. InLinux Clusters, ICVGIP 2002, 3rd Indian Conference on Computer Vision, Graphicsand Image Processing. N.P.: Allied Publishers.Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in thebehavioral sciences. Unpublished doctoral dissertation, Harvard University.Received February 25, 2011; accepted September 16, 2011.

Efficient calculation of the Gauss-Newton approximation of the Hessian matrix in neural networks.

http://repository.essex.ac.uk/21301/8/Efficient%20calculation.pdf

Efficient Calculation of the Gauss-Newton Approximation of the Hessian Matrix in Neural Networks

Abstract

Similar works

Full text

Available Versions

City Research Online

University of Essex Research Repository