We describe four algorithms for neural network training, each adapted to different scalability constraints. These algorithms are mathematically principled and invariant under a number of transformations in data and network representation, from which performance is thus independent. These algorithms are obtained from the setting of differential geometry, and are based on either the natural gradient using the Fisher information matrix, or on Hessian methods, scaled down in a specific way to allow for scalability while keeping some of their key mathematical properties. The most standard way to train neural networks, backpropagation, has several known shortcomings. Convergence can be quite slow. Backpropagation is sensitive to data representation: for instance, even such a simple operation as exchanging 0’s and 1’s on the input layer will affect performance (Figure 1), because this amounts to changing the parameters (weights and biases) in a non-trivial way, resulting in different gradient directions in parameter space, and better performance with 1’s than with 0’s. (In the related context of restriced Boltzmann machines, it has been found that the standard training technique by gradient ascent favors setting hidden units to 1, for very much the same reason [AAHO11, Section 5].) This specific phenomenon disappears if, instead of the logistic function, the hyperbolic tangent is used as the activation function. Scaling also has an effect on performance: for instance, a common recommendation [LBOM96] is to use 1.7159 tanh(
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.