7,944 research outputs found
How degenerate is the parametrization of neural networks with the ReLU activation function?
Neural network training is usually accomplished by solving a non-convex
optimization problem using stochastic gradient descent. Although one optimizes
over the networks parameters, the main loss function generally only depends on
the realization of the neural network, i.e. the function it computes. Studying
the optimization problem over the space of realizations opens up new ways to
understand neural network training. In particular, usual loss functions like
mean squared error and categorical cross entropy are convex on spaces of neural
network realizations, which themselves are non-convex. Approximation
capabilities of neural networks can be used to deal with the latter
non-convexity, which allows us to establish that for sufficiently large
networks local minima of a regularized optimization problem on the realization
space are almost optimal. Note, however, that each realization has many
different, possibly degenerate, parametrizations. In particular, a local
minimum in the parametrization space needs not correspond to a local minimum in
the realization space. To establish such a connection, inverse stability of the
realization map is required, meaning that proximity of realizations must imply
proximity of corresponding parametrizations. We present pathologies which
prevent inverse stability in general, and, for shallow networks, proceed to
establish a restricted space of parametrizations on which we have inverse
stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing
over such restricted sets, it is still possible to learn any function which can
be learned by optimization over unrestricted sets.Comment: Accepted at NeurIPS 201
Approximation results for Gradient Descent trained Shallow Neural Networks in
Two aspects of neural networks that have been extensively studied in the
recent literature are their function approximation properties and their
training by gradient descent methods. The approximation problem seeks accurate
approximations with a minimal number of weights. In most of the current
literature these weights are fully or partially hand-crafted, showing the
capabilities of neural networks but not necessarily their practical
performance. In contrast, optimization theory for neural networks heavily
relies on an abundance of weights in over-parametrized regimes.
This paper balances these two demands and provides an approximation result
for shallow networks in with non-convex weight optimization by gradient
descent. We consider finite width networks and infinite sample limits, which is
the typical setup in approximation theory. Technically, this problem is not
over-parametrized, however, some form of redundancy reappears as a loss in
approximation rate compared to best possible rates
Adaptive Normalized Risk-Averting Training For Deep Neural Networks
This paper proposes a set of new error criteria and learning approaches,
Adaptive Normalized Risk-Averting Training (ANRAT), to attack the non-convex
optimization problem in training deep neural networks (DNNs). Theoretically, we
demonstrate its effectiveness on global and local convexity lower-bounded by
the standard -norm error. By analyzing the gradient on the convexity index
, we explain the reason why to learn adaptively using
gradient descent works. In practice, we show how this method improves training
of deep neural networks to solve visual recognition tasks on the MNIST and
CIFAR-10 datasets. Without using pretraining or other tricks, we obtain results
comparable or superior to those reported in recent literature on the same tasks
using standard ConvNets + MSE/cross entropy. Performance on deep/shallow
multilayer perceptrons and Denoised Auto-encoders is also explored. ANRAT can
be combined with other quasi-Newton training methods, innovative network
variants, regularization techniques and other specific tricks in DNNs. Other
than unsupervised pretraining, it provides a new perspective to address the
non-convex optimization problem in DNNs.Comment: AAAI 2016, 0.39%~0.4% ER on MNIST with single 32-32-256-10 ConvNets,
code available at https://github.com/cauchyturing/ANRA
- …