6 research outputs found
Iterative Least Trimmed Squares for Mixed Linear Regression
Given a linear regression setting, Iterative Least Trimmed Squares (ILTS)
involves alternating between (a) selecting the subset of samples with lowest
current loss, and (b) re-fitting the linear model only on that subset. Both
steps are very fast and simple. In this paper we analyze ILTS in the setting of
mixed linear regression with corruptions (MLR-C). We first establish
deterministic conditions (on the features etc.) under which the ILTS iterate
converges linearly to the closest mixture component. We also provide a global
algorithm that uses ILTS as a subroutine, to fully solve mixed linear
regressions with corruptions. We then evaluate it for the widely studied
setting of isotropic Gaussian features, and establish that we match or better
existing results in terms of sample complexity. Finally, we provide an ODE
analysis for a gradient-descent variant of ILTS that has optimal time
complexity.
Our results provide initial theoretical evidence that iteratively fitting to
the best subset of samples -- a potentially widely applicable idea -- can
provably provide state of the art performance in bad training data settings.Comment: Accepted by NeurIPS 201
Alternating Minimization Converges Super-Linearly for Mixed Linear Regression
We address the problem of solving mixed random linear equations. We have
unlabeled observations coming from multiple linear regressions, and each
observation corresponds to exactly one of the regression models. The goal is to
learn the linear regressors from the observations. Classically, Alternating
Minimization (AM) (which is a variant of Expectation Maximization (EM)) is used
to solve this problem. AM iteratively alternates between the estimation of
labels and solving the regression problems with the estimated labels.
Empirically, it is observed that, for a large variety of non-convex problems
including mixed linear regression, AM converges at a much faster rate compared
to gradient based algorithms. However, the existing theory suggests similar
rate of convergence for AM and gradient based methods, failing to capture this
empirical behavior. In this paper, we close this gap between theory and
practice for the special case of a mixture of linear regressions. We show
that, provided initialized properly, AM enjoys a \emph{super-linear} rate of
convergence in certain parameter regimes. To the best of our knowledge, this is
the first work that theoretically establishes such rate for AM. Hence, if we
want to recover the unknown regressors upto an error (in norm) of
, AM only takes iterations.
Furthermore, we compare AM with a gradient based heuristic algorithm
empirically and show that AM dominates in iteration complexity as well as
wall-clock time.Comment: Accepted for publication at AISTATS, 202
Recovery of sparse linear classifiers from mixture of responses
In the problem of learning a mixture of linear classifiers, the aim is to
learn a collection of hyperplanes from a sequence of binary responses. Each
response is a result of querying with a vector and indicates the side of a
randomly chosen hyperplane from the collection the query vector belongs to.
This model provides a rich representation of heterogeneous data with
categorical labels and has only been studied in some special settings. We look
at a hitherto unstudied problem of query complexity upper bound of recovering
all the hyperplanes, especially for the case when the hyperplanes are sparse.
This setting is a natural generalization of the extreme quantization problem
known as 1-bit compressed sensing. Suppose we have a set of unknown
-sparse vectors. We can query the set with another vector ,
to obtain the sign of the inner product of and a randomly
chosen vector from the -set. How many queries are sufficient to identify
all the unknown vectors? This question is significantly more challenging
than both the basic 1-bit compressed sensing problem (i.e., case) and
the analogous regression problem (where the value instead of the sign is
provided). We provide rigorous query complexity results (with efficient
algorithms) for this problem.Comment: 31 pages, 2 figures (To Appear at NeurIPS 2020
Learning Polynomials of Few Relevant Dimensions
Polynomial regression is a basic primitive in learning and statistics. In its
most basic form the goal is to fit a degree polynomial to a response
variable in terms of an -dimensional input vector . This is extremely
well-studied with many applications and has sample and runtime complexity
. Can one achieve better runtime if the intrinsic dimension of the
data is much smaller than the ambient dimension ? Concretely, we are given
samples where is a degree at most polynomial in an unknown
-dimensional projection (the relevant dimensions) of . This can be seen
both as a generalization of phase retrieval and as a special case of learning
multi-index models where the link function is an unknown low-degree polynomial.
Note that without distributional assumptions, this is at least as hard as junta
learning.
In this work we consider the important case where the covariates are
Gaussian. We give an algorithm that learns the polynomial within accuracy
with sample complexity that is roughly and runtime . Prior to our
work, no such results were known even for the case of . We introduce a new
filtered PCA approach to get a warm start for the true subspace and use
geodesic SGD to boost to arbitrary accuracy; our techniques may be of
independent interest, especially for problems dealing with subspace recovery or
analyzing SGD on manifolds.Comment: 64 page
Extreme Multi-label Classification from Aggregated Labels
Extreme multi-label classification (XMC) is the problem of finding the
relevant labels for an input, from a very large universe of possible labels. We
consider XMC in the setting where labels are available only for groups of
samples - but not for individual ones. Current XMC approaches are not built for
such multi-instance multi-label (MIML) training data, and MIML approaches do
not scale to XMC sizes. We develop a new and scalable algorithm to impute
individual-sample labels from the group labels; this can be paired with any
existing XMC method to solve the aggregated label problem. We characterize the
statistical properties of our algorithm under mild assumptions, and provide a
new end-to-end framework for MIML as an extension. Experiments on both
aggregated label XMC and MIML tasks show the advantages over existing
approaches
Robust Meta-learning for Mixed Linear Regression with Small Batches
A common challenge faced in practical supervised learning, such as medical
image processing and robotic interactions, is that there are plenty of tasks
but each task cannot afford to collect enough labeled examples to be learned in
isolation. However, by exploiting the similarities across those tasks, one can
hope to overcome such data scarcity. Under a canonical scenario where each task
is drawn from a mixture of k linear regressions, we study a fundamental
question: can abundant small-data tasks compensate for the lack of big-data
tasks? Existing second moment based approaches show that such a trade-off is
efficiently achievable, with the help of medium-sized tasks with
examples each. However, this algorithm is brittle in two
important scenarios. The predictions can be arbitrarily bad (i) even with only
a few outliers in the dataset; or (ii) even if the medium-sized tasks are
slightly smaller with examples each. We introduce a spectral
approach that is simultaneously robust under both scenarios. To this end, we
first design a novel outlier-robust principal component analysis algorithm that
achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to
exploit the information from higher order moments. Together, this approach is
robust against outliers and achieves a graceful statistical trade-off; the lack
of -size tasks can be compensated for with smaller tasks,
which can now be as small as .Comment: 52 pages, 2 figure