14,537 research outputs found
BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning
Understanding the global optimality in deep learning (DL) has been attracting
more and more attention recently. Conventional DL solvers, however, have not
been developed intentionally to seek for such global optimality. In this paper
we propose a novel approximation algorithm, BPGrad, towards optimizing deep
models globally via branch and pruning. Our BPGrad algorithm is based on the
assumption of Lipschitz continuity in DL, and as a result it can adaptively
determine the step size for current gradient given the history of previous
updates, wherein theoretically no smaller steps can achieve the global
optimality. We prove that, by repeating such branch-and-pruning procedure, we
can locate the global optimality within finite iterations. Empirically an
efficient solver based on BPGrad for DL is proposed as well, and it outperforms
conventional DL solvers such as Adagrad, Adadelta, RMSProp, and Adam in the
tasks of object recognition, detection, and segmentation
On the Analysis of Trajectories of Gradient Descent in the Optimization of Deep Neural Networks
Theoretical analysis of the error landscape of deep neural networks has
garnered significant interest in recent years. In this work, we theoretically
study the importance of noise in the trajectories of gradient descent towards
optimal solutions in multi-layer neural networks. We show that adding noise (in
different ways) to a neural network while training increases the rank of the
product of weight matrices of a multi-layer linear neural network. We thus
study how adding noise can assist reaching a global optimum when the product
matrix is full-rank (under certain conditions). We establish theoretical
foundations between the noise induced into the neural network - either to the
gradient, to the architecture, or to the input/output to a neural network - and
the rank of product of weight matrices. We corroborate our theoretical findings
with empirical results.Comment: 4 pages + 1 figure (main, excluding references), 5 pages + 4 figures
(appendix
Every Local Minimum Value is the Global Minimum Value of Induced Model in Non-convex Machine Learning
For nonconvex optimization in machine learning, this article proves that
every local minimum achieves the globally optimal value of the perturbable
gradient basis model at any differentiable point. As a result, nonconvex
machine learning is theoretically as supported as convex machine learning with
a handcrafted basis in terms of the loss at differentiable local minima, except
in the case when a preference is given to the handcrafted basis over the
perturbable gradient basis. The proofs of these results are derived under mild
assumptions. Accordingly, the proven results are directly applicable to many
machine learning models, including practical deep neural networks, without any
modification of practical methods. Furthermore, as special cases of our general
results, this article improves or complements several state-of-the-art
theoretical results on deep neural networks, deep residual networks, and
overparameterized deep neural networks with a unified proof technique and novel
geometric insights. A special case of our results also contributes to the
theoretical foundation of representation learning.Comment: Neural computation, MIT pres
- …