9 research outputs found
Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem
We consider the problem of linear estimation, and establish an extension of
the Gauss-Markov theorem, in which the bias operator is allowed to be non-zero
but bounded with respect to a matrix norm of Schatten type. We derive simple
and explicit formulas for the optimal estimator in the cases of Nuclear and
Spectral norms (with the Frobenius case recovering ridge regression).
Additionally, we analytically derive the generalization error in multiple
random matrix ensembles, and compare with Ridge regression. Finally, we conduct
an extensive simulation study, in which we show that the cross-validated
Nuclear and Spectral regressors can outperform Ridge in several circumstances
Feature Learning and Signal Propagation in Deep Neural Networks
Recent work by Baratin et al. (2021) sheds light on an intriguing pattern
that occurs during the training of deep neural networks: some layers align much
more with data compared to other layers (where the alignment is defined as the
euclidean product of the tangent features matrix and the data labels matrix).
The curve of the alignment as a function of layer index (generally) exhibits an
ascent-descent pattern where the maximum is reached for some hidden layer. In
this work, we provide the first explanation for this phenomenon. We introduce
the Equilibrium Hypothesis which connects this alignment pattern to signal
propagation in deep neural networks. Our experiments demonstrate an excellent
match with the theoretical predictions.Comment: 35 page
Almost Sure Convergence of Dropout Algorithms for Neural Networks
We investigate the convergence and convergence rate of stochastic training
algorithms for Neural Networks (NNs) that, over the years, have spawned from
Dropout (Hinton et al., 2012). Modeling that neurons in the brain may not fire,
dropout algorithms consist in practice of multiplying the weight matrices of a
NN component-wise by independently drawn random matrices with -valued
entries during each iteration of the Feedforward-Backpropagation algorithm.
This paper presents a probability theoretical proof that for any NN topology
and differentiable polynomially bounded activation functions, if we project the
NN's weights into a compact set and use a dropout algorithm, then the weights
converge to a unique stationary set of a projected system of Ordinary
Differential Equations (ODEs). We also establish an upper bound on the rate of
convergence of Gradient Descent (GD) on the limiting ODEs of dropout algorithms
for arborescences (a class of trees) of arbitrary depth and with linear
activation functions.Comment: 20 pages, 2 figure