40 research outputs found
Elimination of All Bad Local Minima in Deep Learning
In this paper, we theoretically prove that adding one special neuron per
output unit eliminates all suboptimal local minima of any deep neural network,
for multi-class classification, binary classification, and regression with an
arbitrary loss function, under practical assumptions. At every local minimum of
any deep neural network with these added neurons, the set of parameters of the
original neural network (without added neurons) is guaranteed to be a global
minimum of the original neural network. The effects of the added neurons are
proven to automatically vanish at every local minimum. Moreover, we provide a
novel theoretical characterization of a failure mode of eliminating suboptimal
local minima via an additional theorem and several examples. This paper also
introduces a novel proof technique based on the perturbable gradient basis
(PGB) necessary condition of local minima, which provides new insight into the
elimination of local minima and is applicable to analyze various models and
transformations of objective functions beyond the elimination of local minima.Comment: Accepted to appear in AISTATS 202
Boundary between noise and information applied to filtering neural network weight matrices
Deep neural networks have been successfully applied to a broad range of
problems where overparametrization yields weight matrices which are partially
random. A comparison of weight matrix singular vectors to the Porter-Thomas
distribution suggests that there is a boundary between randomness and learned
information in the singular value spectrum. Inspired by this finding, we
introduce an algorithm for noise filtering, which both removes small singular
values and reduces the magnitude of large singular values to counteract the
effect of level repulsion between the noise and the information part of the
spectrum. For networks trained in the presence of label noise, we indeed find
that the generalization performance improves significantly due to noise
filtering.Comment: 6 pages, 5 figure
The Multilinear Structure of ReLU Networks
We study the loss surface of neural networks equipped with a hinge loss criterion and ReLU or leaky ReLU nonlinearities. Any such network defines a piecewise multilinear form in parameter space. By appealing to harmonic analysis we show that all local minima of such network are non-differentiable, except for those minima that occur in a region of parameter space where the loss surface is perfectly flat. Non-differentiable minima are therefore not technicalities or pathologies; they are heart of the problem when investigating the loss of ReLU networks. As a consequence, we must employ techniques from nonsmooth analysis to study these loss surfaces. We show how to apply these techniques in some illustrative cases