8 research outputs found
Global Convergence of Frank Wolfe on One Hidden Layer Networks
We derive global convergence bounds for the Frank Wolfe algorithm when
training one hidden layer neural networks. When using the ReLU activation
function, and under tractable preconditioning assumptions on the sample data
set, the linear minimization oracle used to incrementally form the solution can
be solved explicitly as a second order cone program. The classical Frank Wolfe
algorithm then converges with rate where is both the number of
neurons and the number of calls to the oracle
Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks
Understanding the fundamental principles behind the success of deep neural
networks is one of the most important open questions in the current literature.
To this end, we study the training problem of deep neural networks and
introduce an analytic approach to unveil hidden convexity in the optimization
landscape. We consider a deep parallel ReLU network architecture, which also
includes standard deep networks and ResNets as its special cases. We then show
that pathwise regularized training problems can be represented as an exact
convex optimization problem. We further prove that the equivalent convex
problem is regularized via a group sparsity inducing norm. Thus, a path
regularized parallel ReLU network can be viewed as a parsimonious convex model
in high dimensions. More importantly, we show that the computational complexity
required to globally optimize the equivalent convex problem is fully
polynomial-time in feature dimension and number of samples. Therefore, we prove
polynomial-time trainability of path regularized ReLU networks with global
optimality guarantees. We also provide several numerical experiments
corroborating our theory
Theoretical Deep Learning
Deep learning has long been criticised as a black-box model for lacking sound theoretical explanation. During the PhD course, I explore and establish theoretical foundations for deep learning. In this thesis, I present my contributions positioned upon existing literature: (1) analysing the generalizability of the neural networks with residual connections via complexity and capacity-based hypothesis complexity measures; (2) modeling stochastic gradient descent (SGD) by stochastic differential equations (SDEs) and their dynamics, and further characterizing the generalizability of deep learning; (3) understanding the geometrical structures of the loss landscape that drives the trajectories of the dynamic systems, which sheds light in reconciling the over-representation and excellent generalizability of deep learning; and (4) discovering the interplay between generalization, privacy preservation, and adversarial robustness, which have seen rising concerns in deep learning deployment