480 research outputs found
Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods
The well known maximum-entropy principle due to Jaynes, which states that
given mean parameters, the maximum entropy distribution matching them is in an
exponential family, has been very popular in machine learning due to its
"Occam's razor" interpretation. Unfortunately, calculating the potentials in
the maximum-entropy distribution is intractable \cite{bresler2014hardness}. We
provide computationally efficient versions of this principle when the mean
parameters are pairwise moments: we design distributions that approximately
match given pairwise moments, while having entropy which is comparable to the
maximum entropy distribution matching those moments.
We additionally provide surprising applications of the approximate maximum
entropy principle to designing provable variational methods for partition
function calculations for Ising models without any assumptions on the
potentials of the model. More precisely, we show that in every temperature, we
can get approximation guarantees for the log-partition function comparable to
those in the low-temperature limit, which is the setting of optimization of
quadratic forms over the hypercube. \cite{alon2006approximating}Comment: 12 page
Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations
Non-negative matrix factorization is a basic tool for decomposing data into
the feature and weight matrices under non-negativity constraints, and in
practice is often solved in the alternating minimization framework. However, it
is unclear whether such algorithms can recover the ground-truth feature matrix
when the weights for different features are highly correlated, which is common
in applications. This paper proposes a simple and natural alternating gradient
descent based algorithm, and shows that with a mild initialization it provably
recovers the ground-truth in the presence of strong correlations. In most
interesting cases, the correlation can be in the same order as the highest
possible. Our analysis also reveals its several favorable features including
robustness to noise. We complement our theoretical results with empirical
studies on semi-synthetic datasets, demonstrating its advantage over several
popular methods in recovering the ground-truth.Comment: Accepted to the International Conference on Machine Learning (ICML),
201
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
In recent years, stochastic gradient descent (SGD) based techniques has
become the standard tools for training neural networks. However, formal
theoretical understanding of why SGD can train neural networks in practice is
largely missing.
In this paper, we make progress on understanding this mystery by providing a
convergence analysis for SGD on a rich subset of two-layer feedforward networks
with ReLU activations. This subset is characterized by a special structure
called "identity mapping". We prove that, if input follows from Gaussian
distribution, with standard initialization of the weights, SGD
converges to the global minimum in polynomial number of steps. Unlike normal
vanilla networks, the "identity mapping" makes our network asymmetric and thus
the global minimum is unique. To complement our theory, we are also able to
show experimentally that multi-layer networks with this mapping have better
performance compared with normal vanilla networks.
Our convergence theorem differs from traditional non-convex optimization
techniques. We show that SGD converges to optimal in "two phases": In phase I,
the gradient points to the wrong direction, however, a potential function
gradually decreases. Then in phase II, SGD enters a nice one point convex
region and converges. We also show that the identity mapping is necessary for
convergence, as it moves the initial point to a better place for optimization.
Experiment verifies our claims
Learning Mixtures of Linear Regressions with Nearly Optimal Complexity
Mixtures of Linear Regressions (MLR) is an important mixture model with many
applications. In this model, each observation is generated from one of the
several unknown linear regression components, where the identity of the
generated component is also unknown. Previous works either assume strong
assumptions on the data distribution or have high complexity. This paper
proposes a fixed parameter tractable algorithm for the problem under general
conditions, which achieves global convergence and the sample complexity scales
nearly linearly in the dimension. In particular, different from previous works
that require the data to be from the standard Gaussian, the algorithm allows
the data from Gaussians with different covariances. When the conditional number
of the covariances and the number of components are fixed, the algorithm has
nearly optimal sample complexity as well as nearly optimal
computational complexity , where is the dimension of the
data space. To the best of our knowledge, this approach provides the first such
recovery guarantee for this general setting.Comment: Fix some typesetting issue in v
What Can ResNet Learn Efficiently, Going Beyond Kernels?
How can neural networks such as ResNet efficiently learn CIFAR-10 with test
accuracy more than 96%, while other methods, especially kernel methods, fall
relatively behind? Can we more provide theoretical justifications for this gap?
Recently, there is an influential line of work relating neural networks to
kernels in the over-parameterized regime, proving they can learn certain
concept class that is also learnable by kernels with similar test error. Yet,
can neural networks provably learn some concept class BETTER than kernels?
We answer this positively in the distribution-free setting. We prove neural
networks can efficiently learn a notable class of functions, including those
defined by three-layer residual networks with smooth activations, without any
distributional assumption. At the same time, we prove there are simple
functions in this class such that with the same number of training examples,
the test error obtained by neural networks can be MUCH SMALLER than ANY kernel
method, including neural tangent kernels (NTK).
The main intuition is that multi-layer neural networks can implicitly perform
hierarchical learning using different layers, which reduces the sample
complexity comparing to "one-shot" learning algorithms such as kernel methods.
In a follow-up work [2], this theory of hierarchical learning is further
strengthened to incorporate the "backward feature correction" process when
training deep networks.
In the end, we also prove a computation complexity advantage of ResNet with
respect to other learning methods including linear regression over arbitrary
feature mappings.Comment: V2 slightly improves lower bound, V3 strengthens experiments and adds
citation to "backward feature correction" which is an even stronger form of
hierarchical learning [2
LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain
We study -SVD that is to obtain the first singular vectors of a matrix
. Recently, a few breakthroughs have been discovered on -SVD: Musco and
Musco [1] proved the first gap-free convergence result using the block Krylov
method, Shamir [2] discovered the first variance-reduction stochastic method,
and Bhojanapalli et al. [3] provided the fastest -time algorithm using alternating minimization.
In this paper, we put forward a new and simple LazySVD framework to improve
the above breakthroughs. This framework leads to a faster gap-free method
outperforming [1], and the first accelerated and stochastic method
outperforming [2]. In the
running-time regime, LazySVD outperforms [3] in certain parameter regimes
without even using alternating minimization.Comment: first circulated on May 20, 2016; this newer version improves writin
Neon2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an
stationary-point finding algorithm into an local-minimum finding one, and (2)
replace the Hessian-vector product computations with only gradient
computations. It works both in the stochastic and the deterministic settings,
without hurting the algorithm's performance.
As applications, our reduction turns Natasha2 into a first-order method
without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into
algorithms finding approximate local minima, outperforming some best known
results.Comment: version 2 and 3 improve writin
First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate
We study streaming principal component analysis (PCA), that is to find, in
space, the top eigenvectors of a hidden matrix with online vectors drawn from covariance matrix .
We provide convergence for Oja's algorithm which is
popularly used in practice but lacks theoretical understanding for . We
also provide a modified variant that runs than Oja's. Our results match the information theoretic lower bound in
terms of dependency on error, on eigengap, on rank , and on dimension ,
up to poly-log factors. In addition, our convergence rate can be made gap-free,
that is proportional to the approximation error and independent of the
eigengap.
In contrast, for general rank , before our work (1) it was open to design
any algorithm with efficient global convergence rate; and (2) it was open to
design any algorithm with (even local) gap-free convergence rate in
space.Comment: REMARK: v4 adds discussions and polishes writing; v3 contains a
stronger Theorem 2, a new lower bound Theorem 6, as well as new Oja++ results
Theorem 4 and Theorem
Recovery guarantee of weighted low-rank approximation via alternating minimization
Many applications require recovering a ground truth low-rank matrix from
noisy observations of the entries, which in practice is typically formulated as
a weighted low-rank approximation problem and solved by non-convex optimization
heuristics such as alternating minimization. In this paper, we provide provable
recovery guarantee of weighted low-rank via a simple alternating minimization
algorithm. In particular, for a natural class of matrices and weights and
without any assumption on the noise, we bound the spectral norm of the
difference between the recovered matrix and the ground truth, by the spectral
norm of the weighted noise plus an additive error that decreases exponentially
with the number of rounds of alternating minimization, from either
initialization by SVD or, more importantly, random initialization. These
provide the first theoretical results for weighted low-rank via alternating
minimization with non-binary deterministic weights, significantly generalizing
those for matrix completion, the special case with binary weights, since our
assumptions are similar or weaker than those made in existing works.
Furthermore, this is achieved by a very simple algorithm that improves the
vanilla alternating minimization with a simple clipping step.
The key technical challenge is that under non-binary deterministic weights,
na\"ive alternating steps will destroy the incoherence and spectral properties
of the intermediate solutions, which are needed for making progress towards the
ground truth. We show that the properties only need to hold in an average sense
and can be achieved by the clipping step.
We further provide an alternating algorithm that uses a whitening step that
keeps the properties via SDP and Rademacher rounding and thus requires weaker
assumptions. This technique can potentially be applied in some other
applications and is of independent interest.Comment: 40 pages. Updated with the ICML 2016 camera ready version, together
with an additional algorithm which needs less assumptions in Appendix
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
We show that the gradient descent algorithm provides an implicit
regularization effect in the learning of over-parameterized matrix
factorization models and one-hidden-layer neural networks with quadratic
activations. Concretely, we show that given random linear
measurements of a rank positive semidefinite matrix , we can
recover by parameterizing it by with and minimizing the squared loss, even if . We prove
that starting from a small initialization, gradient descent recovers
in iterations approximately. The results
solve the conjecture of Gunasekar et al.'17 under the restricted isometry
property. The technique can be applied to analyzing neural networks with
one-hidden-layer quadratic activations with some technical modifications.Comment: COLT 2018 best paper; fixed minor missing steps in the previous
versio
- …