123 research outputs found
Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls
We propose a rank- variant of the classical Frank-Wolfe algorithm to solve
convex optimization over a trace-norm ball. Our algorithm replaces the top
singular-vector computation (-SVD) in Frank-Wolfe with a top-
singular-vector computation (-SVD), which can be done by repeatedly applying
-SVD times. Alternatively, our algorithm can be viewed as a rank-
restricted version of projected gradient descent. We show that our algorithm
has a linear convergence rate when the objective function is smooth and
strongly convex, and the optimal solution has rank at most . This improves
the convergence rate and the total time complexity of the Frank-Wolfe method
and its variants.Comment: In NIPS 201
Much Faster Algorithms for Matrix Scaling
We develop several efficient algorithms for the classical \emph{Matrix
Scaling} problem, which is used in many diverse areas, from preconditioning
linear systems to approximation of the permanent. On an input
matrix , this problem asks to find diagonal (scaling) matrices and
(if they exist), so that -approximates a doubly
stochastic, or more generally a matrix with prescribed row and column sums.
We address the general scaling problem as well as some important special
cases. In particular, if has nonzero entries, and if there exist
and with polynomially large entries such that is doubly stochastic,
then we can solve the problem in total complexity .
This greatly improves on the best known previous results, which were either
or .
Our algorithms are based on tailor-made first and second order techniques,
combined with other recent advances in continuous optimization, which may be of
independent interest for solving similar problems
Backward Feature Correction: How Deep Learning Performs Deep Learning
How does a 110-layer ResNet learn a high-complexity classifier using
relatively few training examples and short training time? We present a theory
towards explaining this in terms of hierarchical learning. We refer
hierarchical learning as the learner learns to represent a complicated target
function by decomposing it into a sequence of simpler functions to reduce
sample and time complexity. This paper formally analyzes how multi-layer neural
networks can perform such hierarchical learning efficiently and automatically
by applying SGD.
On the conceptual side, we present, to the best of our knowledge, the FIRST
theory result indicating how deep neural networks can be sample and time
efficient on certain hierarchical learning tasks, when NO KNOWN
non-hierarchical algorithms (such as kernel method, linear regression over
feature mappings, tensor decomposition, sparse coding, and their simple
combinations) are efficient. We establish a principle called "backward feature
correction", where training higher layers in the network can improve the
features of lower level ones. We believe this is the key to understand the deep
learning process in multi-layer neural networks.
On the technical side, we show for every input dimension , there is a
concept class consisting of degree multi-variate polynomials so
that, using -layer neural networks as learners, SGD can learn any
target function from this class in time using
samples to any error, through
learning to represent it as a composition of layers of quadratic
functions. In contrast, we present lower bounds stating that several
non-hierarchical learners, including any kernel methods, neural tangent
kernels, must suffer from sample or time complexity to learn
this concept class even to error.Comment: V2 adds more experiments, V3 polishes writing and improves
experiments, V4 makes minor fixes to the figure
- …