Search CORE

123 research outputs found

Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

Author: Allen-Zhu Zeyuan
Hazan Elad
Hu Wei
Li Yuanzhi
Publication venue
Publication date: 01/01/2017
Field of study

We propose a rank-

k

variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation (

1

-SVD) in Frank-Wolfe with a top-

k

singular-vector computation (

k

-SVD), which can be done by repeatedly applying

1

-SVD

k

times. Alternatively, our algorithm can be viewed as a rank-

k

restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most

k

. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants.Comment: In NIPS 201

arXiv.org e-Print Archive

Princeton University Open Access Repository

Much Faster Algorithms for Matrix Scaling

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Oliveira Rafael
Wigderson Avi
Publication venue
Publication date: 07/04/2017
Field of study

We develop several efficient algorithms for the classical \emph{Matrix Scaling} problem, which is used in many diverse areas, from preconditioning linear systems to approximation of the permanent. On an input

n\times n

matrix

A

, this problem asks to find diagonal (scaling) matrices

X

and

Y

(if they exist), so that

X A Y

\varepsilon

-approximates a doubly stochastic, or more generally a matrix with prescribed row and column sums. We address the general scaling problem as well as some important special cases. In particular, if

A

has

m

nonzero entries, and if there exist

X

and

Y

with polynomially large entries such that

X A Y

is doubly stochastic, then we can solve the problem in total complexity

\tilde{O}(m + n^{4/3})

. This greatly improves on the best known previous results, which were either

\tilde{O}(n^4)

O(m n^{1/2}/\varepsilon)

. Our algorithms are based on tailor-made first and second order techniques, combined with other recent advances in continuous optimization, which may be of independent interest for solving similar problems

arXiv.org e-Print Archive

Crossref

Backward Feature Correction: How Deep Learning Performs Deep Learning

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 10/09/2020
Field of study

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically by applying SGD. On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how deep neural networks can be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a principle called "backward feature correction", where training higher layers in the network can improve the features of lower level ones. We believe this is the key to understand the deep learning process in multi-layer neural networks. On the technical side, we show for every input dimension

d > 0

, there is a concept class consisting of degree

\omega(1)

multi-variate polynomials so that, using

\omega(1)

-layer neural networks as learners, SGD can learn any target function from this class in

\mathsf{poly}(d)

time using

\mathsf{poly}(d)

samples to any

\frac{1}{\mathsf{poly}(d)}

error, through learning to represent it as a composition of

\omega(1)

layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from

d^{\omega(1)}

sample or time complexity to learn this concept class even to

d^{-0.01}

error.Comment: V2 adds more experiments, V3 polishes writing and improves experiments, V4 makes minor fixes to the figure

arXiv.org e-Print Archive