Search CORE

6 research outputs found

Iterative Least Trimmed Squares for Mixed Linear Regression

Author: Sanghavi Sujay
Shen Yanyao
Publication venue
Publication date: 12/11/2019
Field of study

Given a linear regression setting, Iterative Least Trimmed Squares (ILTS) involves alternating between (a) selecting the subset of samples with lowest current loss, and (b) re-fitting the linear model only on that subset. Both steps are very fast and simple. In this paper we analyze ILTS in the setting of mixed linear regression with corruptions (MLR-C). We first establish deterministic conditions (on the features etc.) under which the ILTS iterate converges linearly to the closest mixture component. We also provide a global algorithm that uses ILTS as a subroutine, to fully solve mixed linear regressions with corruptions. We then evaluate it for the widely studied setting of isotropic Gaussian features, and establish that we match or better existing results in terms of sample complexity. Finally, we provide an ODE analysis for a gradient-descent variant of ILTS that has optimal time complexity. Our results provide initial theoretical evidence that iteratively fitting to the best subset of samples -- a potentially widely applicable idea -- can provably provide state of the art performance in bad training data settings.Comment: Accepted by NeurIPS 201

arXiv.org e-Print Archive

Alternating Minimization Converges Super-Linearly for Mixed Linear Regression

Author: Ghosh Avishek
Ramchandran Kannan
Publication venue
Publication date: 11/08/2020
Field of study

We address the problem of solving mixed random linear equations. We have unlabeled observations coming from multiple linear regressions, and each observation corresponds to exactly one of the regression models. The goal is to learn the linear regressors from the observations. Classically, Alternating Minimization (AM) (which is a variant of Expectation Maximization (EM)) is used to solve this problem. AM iteratively alternates between the estimation of labels and solving the regression problems with the estimated labels. Empirically, it is observed that, for a large variety of non-convex problems including mixed linear regression, AM converges at a much faster rate compared to gradient based algorithms. However, the existing theory suggests similar rate of convergence for AM and gradient based methods, failing to capture this empirical behavior. In this paper, we close this gap between theory and practice for the special case of a mixture of

2

linear regressions. We show that, provided initialized properly, AM enjoys a \emph{super-linear} rate of convergence in certain parameter regimes. To the best of our knowledge, this is the first work that theoretically establishes such rate for AM. Hence, if we want to recover the unknown regressors upto an error (in

\ell_2

norm) of

\epsilon

, AM only takes

\mathcal{O}(\log \log (1/\epsilon))

iterations. Furthermore, we compare AM with a gradient based heuristic algorithm empirically and show that AM dominates in iteration complexity as well as wall-clock time.Comment: Accepted for publication at AISTATS, 202

arXiv.org e-Print Archive

Recovery of sparse linear classifiers from mixture of responses

Author: Gandikota Venkata
Mazumdar Arya
Pal Soumyabrata
Publication venue
Publication date: 24/12/2020
Field of study

In the problem of learning a mixture of linear classifiers, the aim is to learn a collection of hyperplanes from a sequence of binary responses. Each response is a result of querying with a vector and indicates the side of a randomly chosen hyperplane from the collection the query vector belongs to. This model provides a rich representation of heterogeneous data with categorical labels and has only been studied in some special settings. We look at a hitherto unstudied problem of query complexity upper bound of recovering all the hyperplanes, especially for the case when the hyperplanes are sparse. This setting is a natural generalization of the extreme quantization problem known as 1-bit compressed sensing. Suppose we have a set of

\ell

unknown

k

-sparse vectors. We can query the set with another vector

\boldsymbol{a}

, to obtain the sign of the inner product of

\boldsymbol{a}

and a randomly chosen vector from the

\ell

-set. How many queries are sufficient to identify all the

\ell

unknown vectors? This question is significantly more challenging than both the basic 1-bit compressed sensing problem (i.e.,

\ell=1

case) and the analogous regression problem (where the value instead of the sign is provided). We provide rigorous query complexity results (with efficient algorithms) for this problem.Comment: 31 pages, 2 figures (To Appear at NeurIPS 2020

arXiv.org e-Print Archive

Learning Polynomials of Few Relevant Dimensions

Author: Chen Sitan
Meka Raghu
Publication venue
Publication date: 28/04/2020
Field of study

Polynomial regression is a basic primitive in learning and statistics. In its most basic form the goal is to fit a degree

d

polynomial to a response variable

y

in terms of an

n

-dimensional input vector

x

. This is extremely well-studied with many applications and has sample and runtime complexity

\Theta(n^d)

. Can one achieve better runtime if the intrinsic dimension of the data is much smaller than the ambient dimension

n

? Concretely, we are given samples

(x,y)

where

y

is a degree at most

d

polynomial in an unknown

r

-dimensional projection (the relevant dimensions) of

x

. This can be seen both as a generalization of phase retrieval and as a special case of learning multi-index models where the link function is an unknown low-degree polynomial. Note that without distributional assumptions, this is at least as hard as junta learning. In this work we consider the important case where the covariates are Gaussian. We give an algorithm that learns the polynomial within accuracy

\epsilon

with sample complexity that is roughly

N = O_{r,d}(n \log^2(1/\epsilon) (\log n)^d)

and runtime

O_{r,d}(N n^2)

. Prior to our work, no such results were known even for the case of

r=1

. We introduce a new filtered PCA approach to get a warm start for the true subspace and use geodesic SGD to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds.Comment: 64 page

arXiv.org e-Print Archive

Extreme Multi-label Classification from Aggregated Labels

Author: Dhillon Inderjit
Sanghavi Sujay
Shen Yanyao
Yu Hsiang-fu
Publication venue
Publication date: 31/03/2020
Field of study

Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input, from a very large universe of possible labels. We consider XMC in the setting where labels are available only for groups of samples - but not for individual ones. Current XMC approaches are not built for such multi-instance multi-label (MIML) training data, and MIML approaches do not scale to XMC sizes. We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem. We characterize the statistical properties of our algorithm under mild assumptions, and provide a new end-to-end framework for MIML as an extension. Experiments on both aggregated label XMC and MIML tasks show the advantages over existing approaches

arXiv.org e-Print Archive

Robust Meta-learning for Mixed Linear Regression with Small Batches

Author: Kakade Sham
Kong Weihao
Oh Sewoong
Somani Raghav
Publication venue
Publication date: 18/06/2020
Field of study

A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with

\Omega(k^{1/2})

examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad (i) even with only a few outliers in the dataset; or (ii) even if the medium-sized tasks are slightly smaller with

o(k^{1/2})

examples each. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statistical trade-off; the lack of

\Omega(k^{1/2})

-size tasks can be compensated for with smaller tasks, which can now be as small as

O(\log k)

.Comment: 52 pages, 2 figure

arXiv.org e-Print Archive