1,362 research outputs found
Completing Any Low-rank Matrix, Provably
Matrix completion, i.e., the exact and provable recovery of a low-rank matrix
from a small subset of its elements, is currently only known to be possible if
the matrix satisfies a restrictive structural constraint---known as {\em
incoherence}---on its row and column spaces. In these cases, the subset of
elements is sampled uniformly at random.
In this paper, we show that {\em any} rank- -by- matrix can be
exactly recovered from as few as randomly chosen elements,
provided this random choice is made according to a {\em specific biased
distribution}: the probability of any element being sampled should be
proportional to the sum of the leverage scores of the corresponding row, and
column. Perhaps equally important, we show that this specific form of sampling
is nearly necessary, in a natural precise sense; this implies that other
perhaps more intuitive sampling schemes fail.
We further establish three ways to use the above result for the setting when
leverage scores are not known \textit{a priori}: (a) a sampling strategy for
the case when only one of the row or column spaces are incoherent, (b) a
two-phase sampling procedure for general matrices that first samples to
estimate leverage scores followed by sampling for exact recovery, and (c) an
analysis showing the advantages of weighted nuclear/trace-norm minimization
over the vanilla un-weighted formulation for the case of non-uniform sampling.Comment: Added a new necessary condition(Theorem 6) and a result on completion
of row coherent matrices(Corollary 4). Partial results appeared in the
International Conference on Machine Learning 2014, under the title 'Coherent
Matrix Completion'. (34 pages, 4 figures
Static and Dynamic Robust PCA and Matrix Completion: A Review
Principal Components Analysis (PCA) is one of the most widely used dimension
reduction techniques. Robust PCA (RPCA) refers to the problem of PCA when the
data may be corrupted by outliers. Recent work by Cand{\`e}s, Wright, Li, and
Ma defined RPCA as a problem of decomposing a given data matrix into the sum of
a low-rank matrix (true data) and a sparse matrix (outliers). The column space
of the low-rank matrix then gives the PCA solution. This simple definition has
lead to a large amount of interesting new work on provably correct, fast, and
practical solutions to RPCA. More recently, the dynamic (time-varying) version
of the RPCA problem has been studied and a series of provably correct, fast,
and memory efficient tracking solutions have been proposed. Dynamic RPCA (or
robust subspace tracking) is the problem of tracking data lying in a (slowly)
changing subspace while being robust to sparse outliers. This article provides
an exhaustive review of the last decade of literature on RPCA and its dynamic
counterpart (robust subspace tracking), along with describing their theoretical
guarantees, discussing the pros and cons of various approaches, and providing
empirical comparisons of performance and speed.
A brief overview of the (low-rank) matrix completion literature is also
provided (the focus is on works not discussed in other recent reviews). This
refers to the problem of completing a low-rank matrix when only a subset of its
entries are observed. It can be interpreted as a simpler special case of RPCA
in which the indices of the outlier corrupted entries are known.Comment: To appear in Proceedings of the IEEE, Special Issue on Rethinking PCA
for Modern Datasets. arXiv admin note: text overlap with arXiv:1711.0949
Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data
We consider the problem of matrix column subset selection, which selects a
subset of columns from an input matrix such that the input can be well
approximated by the span of the selected columns. Column subset selection has
been applied to numerous real-world data applications such as population
genetics summarization, electronic circuits testing and recommendation systems.
In many applications the complete data matrix is unavailable and one needs to
select representative columns by inspecting only a small portion of the input
matrix. In this paper we propose the first provably correct column subset
selection algorithms for partially observed data matrices. Our proposed
algorithms exhibit different merits and limitations in terms of statistical
accuracy, computational efficiency, sample complexity and sampling schemes,
which provides a nice exploration of the tradeoff between these desired
properties for column subset selection. The proposed methods employ the idea of
feedback driven sampling and are inspired by several sampling schemes
previously introduced for low-rank matrix approximation tasks (Drineas et al.,
2008; Frieze et al., 2004; Deshpande and Vempala, 2006; Krishnamurthy and
Singh, 2014). Our analysis shows that, under the assumption that the input data
matrix has incoherent rows but possibly coherent columns, all algorithms
provably converge to the best low-rank approximation of the original data as
number of selected columns increases. Furthermore, two of the proposed
algorithms enjoy a relative error bound, which is preferred for column subset
selection and matrix approximation purposes. We also demonstrate through both
theoretical and empirical analysis the power of feedback driven sampling
compared to uniform random sampling on input matrices with highly correlated
columns.Comment: 42 pages. Accepted to Journal of Machine Learning Researc
Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
Modern neural networks are typically trained in an over-parameterized regime
where the parameters of the model far exceed the size of the training data.
Such neural networks in principle have the capacity to (over)fit any set of
labels including pure noise. Despite this, somewhat paradoxically, neural
network models trained via first-order methods continue to predict well on yet
unseen test data. This paper takes a step towards demystifying this phenomena.
Under a rich dataset model, we show that gradient descent is provably robust to
noise/corruption on a constant fraction of the labels despite
overparameterization. In particular, we prove that: (i) In the first few
iterations where the updates are still in the vicinity of the initialization
gradient descent only fits to the correct labels essentially ignoring the noisy
labels. (ii) to start to overfit to the noisy labels network must stray rather
far from from the initialization which can only occur after many more
iterations. Together, these results show that gradient descent with early
stopping is provably robust to label noise and shed light on the empirical
robustness of deep networks as well as commonly adopted heuristics to prevent
overfitting
Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method
We consider the problem of recovering a complete (i.e., square and
invertible) matrix , from
with , provided is
sufficiently sparse. This recovery problem is central to theoretical
understanding of dictionary learning, which seeks a sparse representation for a
collection of input signals and finds numerous applications in modern signal
processing and machine learning. We give the first efficient algorithm that
provably recovers when has nonzeros per
column, under suitable probability model for .
Our algorithmic pipeline centers around solving a certain nonconvex
optimization problem with a spherical constraint, and hence is naturally
phrased in the language of manifold optimization. In a companion paper
(arXiv:1511.03607), we have showed that with high probability our nonconvex
formulation has no "spurious" local minimizers and around any saddle point the
objective function has a negative directional curvature. In this paper, we take
advantage of the particular geometric structure, and describe a Riemannian
trust region algorithm that provably converges to a local minimizer with from
arbitrary initializations. Such minimizers give excellent approximations to
rows of . The rows are then recovered by linear programming
rounding and deflation.Comment: The second of two papers based on the report arXiv:1504.06785.
Accepted by IEEE Transaction on Information Theory; revised according to the
reviewers' comment
Rank/Norm Regularization with Closed-Form Solutions: Application to Subspace Clustering
When data is sampled from an unknown subspace, principal component analysis
(PCA) provides an effective way to estimate the subspace and hence reduce the
dimension of the data. At the heart of PCA is the Eckart-Young-Mirsky theorem,
which characterizes the best rank k approximation of a matrix. In this paper,
we prove a generalization of the Eckart-Young-Mirsky theorem under all
unitarily invariant norms. Using this result, we obtain closed-form solutions
for a set of rank/norm regularized problems, and derive closed-form solutions
for a general class of subspace clustering problems (where data is modelled by
unions of unknown subspaces). From these results we obtain new theoretical
insights and promising experimental results.Comment: 11 pages, 1 figure, appeared in UAI 2011. One footnote corrected and
appendix adde
Log-Normal Matrix Completion for Large Scale Link Prediction
The ubiquitous proliferation of online social networks has led to the
widescale emergence of relational graphs expressing unique patterns in link
formation and descriptive user node features. Matrix Factorization and
Completion have become popular methods for Link Prediction due to the low rank
nature of mutual node friendship information, and the availability of parallel
computer architectures for rapid matrix processing. Current Link Prediction
literature has demonstrated vast performance improvement through the
utilization of sparsity in addition to the low rank matrix assumption. However,
the majority of research has introduced sparsity through the limited L1 or
Frobenius norms, instead of considering the more detailed distributions which
led to the graph formation and relationship evolution. In particular, social
networks have been found to express either Pareto, or more recently discovered,
Log Normal distributions. Employing the convexity-inducing Lovasz Extension, we
demonstrate how incorporating specific degree distribution information can lead
to large scale improvements in Matrix Completion based Link prediction. We
introduce Log-Normal Matrix Completion (LNMC), and solve the complex
optimization problem by employing Alternating Direction Method of Multipliers.
Using data from three popular social networks, our experiments yield up to 5%
AUC increase over top-performing non-structured sparsity based methods.Comment: 6 page
Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
Many modern learning tasks involve fitting nonlinear models to data which are
trained in an overparameterized regime where the parameters of the model exceed
the size of the training dataset. Due to this overparameterization, the
training loss may have infinitely many global minima and it is critical to
understand the properties of the solutions found by first-order optimization
schemes such as (stochastic) gradient descent starting from different
initializations. In this paper we demonstrate that when the loss has certain
properties over a minimally small neighborhood of the initial point, first
order methods such as (stochastic) gradient descent have a few intriguing
properties: (1) the iterates converge at a geometric rate to a global optima
even when the loss is nonconvex, (2) among all global optima of the loss the
iterates converge to one with a near minimal distance to the initial point, (3)
the iterates take a near direct route from the initial point to this global
optima. As part of our proof technique, we introduce a new potential function
which captures the precise tradeoff between the loss function and the distance
to the initial point as the iterations progress. For Stochastic Gradient
Descent (SGD), we develop novel martingale techniques that guarantee SGD never
leaves a small neighborhood of the initialization, even with rather large
learning rates. We demonstrate the utility of our general theory for a variety
of problem domains spanning low-rank matrix recovery to neural network
training. Underlying our analysis are novel insights that may have implications
for training and generalization of more sophisticated learning problems
including those involving deep neural network architectures
Recursive Sampling for the Nystr\"om Method
We give the first algorithm for kernel Nystr\"om approximation that runs in
*linear time in the number of training points* and is provably accurate for all
kernel matrices, without dependence on regularity or incoherence conditions.
The algorithm projects the kernel onto a set of landmark points sampled by
their *ridge leverage scores*, requiring just kernel evaluations and
additional runtime. While leverage score sampling has long been known
to give strong theoretical guarantees for Nystr\"om approximation, by employing
a fast recursive sampling scheme, our algorithm is the first to make the
approach scalable. Empirically we show that it finds more accurate, lower rank
kernel approximations in less time than popular techniques such as uniformly
sampled Nystr\"om approximation and the random Fourier features method.Comment: To appear, NIPS 201
Finding a sparse vector in a subspace: Linear sparsity using alternating directions
Is it possible to find the sparsest vector (direction) in a generic subspace
with ?
This problem can be considered a homogeneous variant of the sparse recovery
problem, and finds connections to sparse dictionary learning, sparse PCA, and
many other problems in signal processing and machine learning. In this paper,
we focus on a **planted sparse model** for the subspace: the target sparse
vector is embedded in an otherwise random subspace. Simple convex heuristics
for this planted recovery problem provably break down when the fraction of
nonzero entries in the target sparse vector substantially exceeds
. In contrast, we exhibit a relatively simple nonconvex approach
based on alternating directions, which provably succeeds even when the fraction
of nonzero entries is . To the best of our knowledge, this is the
first practical algorithm to achieve linear scaling under the planted sparse
model. Empirically, our proposed algorithm also succeeds in more challenging
data models, e.g., sparse dictionary learning.Comment: Accepted by IEEE Trans. Information Theory. The paper has been
revised by the reviewers' comments. The proofs have been streamline
- β¦