Search CORE

480 research outputs found

Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods

Author: Li Yuanzhi
Risteski Andrej
Publication venue
Publication date: 12/07/2016
Field of study

The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family, has been very popular in machine learning due to its "Occam's razor" interpretation. Unfortunately, calculating the potentials in the maximum-entropy distribution is intractable \cite{bresler2014hardness}. We provide computationally efficient versions of this principle when the mean parameters are pairwise moments: we design distributions that approximately match given pairwise moments, while having entropy which is comparable to the maximum entropy distribution matching those moments. We additionally provide surprising applications of the approximate maximum entropy principle to designing provable variational methods for partition function calculations for Ising models without any assumptions on the potentials of the model. More precisely, we show that in every temperature, we can get approximation guarantees for the log-partition function comparable to those in the low-temperature limit, which is the setting of optimization of quadratic forms over the hypercube. \cite{alon2006approximating}Comment: 12 page

arXiv.org e-Print Archive

Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations

Author: Li Yuanzhi
Liang Yingyu
Publication venue
Publication date: 13/06/2017
Field of study

Non-negative matrix factorization is a basic tool for decomposing data into the feature and weight matrices under non-negativity constraints, and in practice is often solved in the alternating minimization framework. However, it is unclear whether such algorithms can recover the ground-truth feature matrix when the weights for different features are highly correlated, which is common in applications. This paper proposes a simple and natural alternating gradient descent based algorithm, and shows that with a mild initialization it provably recovers the ground-truth in the presence of strong correlations. In most interesting cases, the correlation can be in the same order as the highest possible. Our analysis also reveals its several favorable features including robustness to noise. We complement our theoretical results with empirical studies on semi-synthetic datasets, demonstrating its advantage over several popular methods in recovering the ground-truth.Comment: Accepted to the International Conference on Machine Learning (ICML), 201

arXiv.org e-Print Archive

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Author: Li Yuanzhi
Yuan Yang
Publication venue
Publication date: 01/11/2017
Field of study

In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard

O(1/\sqrt{d})

initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function

g

gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims

arXiv.org e-Print Archive

Learning Mixtures of Linear Regressions with Nearly Optimal Complexity

Author: Li Yuanzhi
Liang Yingyu
Publication venue
Publication date: 28/03/2020
Field of study

Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity

N = \tilde{O}(d)

as well as nearly optimal computational complexity

\tilde{O}(Nd)

, where

d

is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.Comment: Fix some typesetting issue in v

arXiv.org e-Print Archive

What Can ResNet Learn Efficiently, Going Beyond Kernels?

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 01/06/2020
Field of study

How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class BETTER than kernels? We answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be MUCH SMALLER than ANY kernel method, including neural tangent kernels (NTK). The main intuition is that multi-layer neural networks can implicitly perform hierarchical learning using different layers, which reduces the sample complexity comparing to "one-shot" learning algorithms such as kernel methods. In a follow-up work [2], this theory of hierarchical learning is further strengthened to incorporate the "backward feature correction" process when training deep networks. In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.Comment: V2 slightly improves lower bound, V3 strengthens experiments and adds citation to "backward feature correction" which is an even stronger form of hierarchical learning [2

arXiv.org e-Print Archive

LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 23/01/2017
Field of study

We study

k

-SVD that is to obtain the first

k

singular vectors of a matrix

A

. Recently, a few breakthroughs have been discovered on

k

-SVD: Musco and Musco [1] proved the first gap-free convergence result using the block Krylov method, Shamir [2] discovered the first variance-reduction stochastic method, and Bhojanapalli et al. [3] provided the fastest

O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))

-time algorithm using alternating minimization. In this paper, we put forward a new and simple LazySVD framework to improve the above breakthroughs. This framework leads to a faster gap-free method outperforming [1], and the first accelerated and stochastic method outperforming [2]. In the

O(\mathsf{nnz}(A) + \mathsf{poly}(1/\varepsilon))

running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.Comment: first circulated on May 20, 2016; this newer version improves writin

arXiv.org e-Print Archive

Neon2: Finding Local Minima via First-Order Oracles

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 20/04/2018
Field of study

We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results.Comment: version 2 and 3 improve writin

arXiv.org e-Print Archive

First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate

Author: Allen-Zhu Zeyuan
Li Yuanzhi
Publication venue
Publication date: 16/04/2017
Field of study

We study streaming principal component analysis (PCA), that is to find, in

O(dk)

space, the top

k

eigenvectors of a

d\times d

hidden matrix

\bf \Sigma

with online vectors drawn from covariance matrix

\bf \Sigma

. We provide

\textit{global}

convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for

k>1

. We also provide a modified variant

\mathsf{Oja}^{++}

that runs

\textit{even faster}

than Oja's. Our results match the information theoretic lower bound in terms of dependency on error, on eigengap, on rank

k

, and on dimension

d

, up to poly-log factors. In addition, our convergence rate can be made gap-free, that is proportional to the approximation error and independent of the eigengap. In contrast, for general rank

k

, before our work (1) it was open to design any algorithm with efficient global convergence rate; and (2) it was open to design any algorithm with (even local) gap-free convergence rate in

O(dk)

space.Comment: REMARK: v4 adds discussions and polishes writing; v3 contains a stronger Theorem 2, a new lower bound Theorem 6, as well as new Oja++ results Theorem 4 and Theorem

arXiv.org e-Print Archive

Recovery guarantee of weighted low-rank approximation via alternating minimization

Author: Li Yuanzhi
Liang Yingyu
Risteski Andrej
Publication venue
Publication date: 08/12/2016
Field of study

Many applications require recovering a ground truth low-rank matrix from noisy observations of the entries, which in practice is typically formulated as a weighted low-rank approximation problem and solved by non-convex optimization heuristics such as alternating minimization. In this paper, we provide provable recovery guarantee of weighted low-rank via a simple alternating minimization algorithm. In particular, for a natural class of matrices and weights and without any assumption on the noise, we bound the spectral norm of the difference between the recovered matrix and the ground truth, by the spectral norm of the weighted noise plus an additive error that decreases exponentially with the number of rounds of alternating minimization, from either initialization by SVD or, more importantly, random initialization. These provide the first theoretical results for weighted low-rank via alternating minimization with non-binary deterministic weights, significantly generalizing those for matrix completion, the special case with binary weights, since our assumptions are similar or weaker than those made in existing works. Furthermore, this is achieved by a very simple algorithm that improves the vanilla alternating minimization with a simple clipping step. The key technical challenge is that under non-binary deterministic weights, na\"ive alternating steps will destroy the incoherence and spectral properties of the intermediate solutions, which are needed for making progress towards the ground truth. We show that the properties only need to hold in an average sense and can be achieved by the clipping step. We further provide an alternating algorithm that uses a whitening step that keeps the properties via SDP and Rademacher rounding and thus requires weaker assumptions. This technique can potentially be applied in some other applications and is of independent interest.Comment: 40 pages. Updated with the ICML 2016 camera ready version, together with an additional algorithm which needs less assumptions in Appendix

arXiv.org e-Print Archive

Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations

Author: Li Yuanzhi
Ma Tengyu
Zhang Hongyang
Publication venue
Publication date: 13/02/2019
Field of study

We show that the gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations. Concretely, we show that given

\tilde{O}(dr^{2})

random linear measurements of a rank

r

positive semidefinite matrix

X^{\star}

, we can recover

X^{\star}

by parameterizing it by

UU^\top

with

U\in \mathbb R^{d\times d}

and minimizing the squared loss, even if

r \ll d

. We prove that starting from a small initialization, gradient descent recovers

X^{\star}

\tilde{O}(\sqrt{r})

iterations approximately. The results solve the conjecture of Gunasekar et al.'17 under the restricted isometry property. The technique can be applied to analyzing neural networks with one-hidden-layer quadratic activations with some technical modifications.Comment: COLT 2018 best paper; fixed minor missing steps in the previous versio

arXiv.org e-Print Archive