Since the recognition in the early nineties of the vanishing/exploding (V/E)
gradient issue plaguing the training of neural networks (NNs), significant
efforts have been exerted to overcome this obstacle. However, a clear solution
to the V/E issue remained elusive so far. In this manuscript a new architecture
of NN is proposed, designed to mathematically prevent the V/E issue to occur.
The pursuit of approximate dynamical isometry, i.e. parameter configurations
where the singular values of the input-output Jacobian are tightly distributed
around 1, leads to the derivation of a NN's architecture that shares common
traits with the popular Residual Network model. Instead of skipping connections
between layers, the idea is to filter the previous activations orthogonally and
add them to the nonlinear activations of the next layer, realising a convex
combination between them. Remarkably, the impossibility for the gradient
updates to either vanish or explode is demonstrated with analytical bounds that
hold even in the infinite depth case. The effectiveness of this method is
empirically proved by means of training via backpropagation an extremely deep
multilayer perceptron of 50k layers, and an Elman NN to learn long-term
dependencies in the input of 10k time steps in the past. Compared with other
architectures specifically devised to deal with the V/E problem, e.g. LSTMs for
recurrent NNs, the proposed model is way simpler yet more effective.
Surprisingly, a single layer vanilla RNN can be enhanced to reach state of the
art performance, while converging super fast; for instance on the psMNIST task,
it is possible to get test accuracy of over 94% in the first epoch, and over
98% after just 10 epochs