6 research outputs found

    Iterative Least Trimmed Squares for Mixed Linear Regression

    Full text link
    Given a linear regression setting, Iterative Least Trimmed Squares (ILTS) involves alternating between (a) selecting the subset of samples with lowest current loss, and (b) re-fitting the linear model only on that subset. Both steps are very fast and simple. In this paper we analyze ILTS in the setting of mixed linear regression with corruptions (MLR-C). We first establish deterministic conditions (on the features etc.) under which the ILTS iterate converges linearly to the closest mixture component. We also provide a global algorithm that uses ILTS as a subroutine, to fully solve mixed linear regressions with corruptions. We then evaluate it for the widely studied setting of isotropic Gaussian features, and establish that we match or better existing results in terms of sample complexity. Finally, we provide an ODE analysis for a gradient-descent variant of ILTS that has optimal time complexity. Our results provide initial theoretical evidence that iteratively fitting to the best subset of samples -- a potentially widely applicable idea -- can provably provide state of the art performance in bad training data settings.Comment: Accepted by NeurIPS 201

    Alternating Minimization Converges Super-Linearly for Mixed Linear Regression

    Full text link
    We address the problem of solving mixed random linear equations. We have unlabeled observations coming from multiple linear regressions, and each observation corresponds to exactly one of the regression models. The goal is to learn the linear regressors from the observations. Classically, Alternating Minimization (AM) (which is a variant of Expectation Maximization (EM)) is used to solve this problem. AM iteratively alternates between the estimation of labels and solving the regression problems with the estimated labels. Empirically, it is observed that, for a large variety of non-convex problems including mixed linear regression, AM converges at a much faster rate compared to gradient based algorithms. However, the existing theory suggests similar rate of convergence for AM and gradient based methods, failing to capture this empirical behavior. In this paper, we close this gap between theory and practice for the special case of a mixture of 22 linear regressions. We show that, provided initialized properly, AM enjoys a \emph{super-linear} rate of convergence in certain parameter regimes. To the best of our knowledge, this is the first work that theoretically establishes such rate for AM. Hence, if we want to recover the unknown regressors upto an error (in 2\ell_2 norm) of ϵ\epsilon, AM only takes O(loglog(1/ϵ))\mathcal{O}(\log \log (1/\epsilon)) iterations. Furthermore, we compare AM with a gradient based heuristic algorithm empirically and show that AM dominates in iteration complexity as well as wall-clock time.Comment: Accepted for publication at AISTATS, 202

    Recovery of sparse linear classifiers from mixture of responses

    Full text link
    In the problem of learning a mixture of linear classifiers, the aim is to learn a collection of hyperplanes from a sequence of binary responses. Each response is a result of querying with a vector and indicates the side of a randomly chosen hyperplane from the collection the query vector belongs to. This model provides a rich representation of heterogeneous data with categorical labels and has only been studied in some special settings. We look at a hitherto unstudied problem of query complexity upper bound of recovering all the hyperplanes, especially for the case when the hyperplanes are sparse. This setting is a natural generalization of the extreme quantization problem known as 1-bit compressed sensing. Suppose we have a set of \ell unknown kk-sparse vectors. We can query the set with another vector a\boldsymbol{a}, to obtain the sign of the inner product of a\boldsymbol{a} and a randomly chosen vector from the \ell-set. How many queries are sufficient to identify all the \ell unknown vectors? This question is significantly more challenging than both the basic 1-bit compressed sensing problem (i.e., =1\ell=1 case) and the analogous regression problem (where the value instead of the sign is provided). We provide rigorous query complexity results (with efficient algorithms) for this problem.Comment: 31 pages, 2 figures (To Appear at NeurIPS 2020

    Learning Polynomials of Few Relevant Dimensions

    Full text link
    Polynomial regression is a basic primitive in learning and statistics. In its most basic form the goal is to fit a degree dd polynomial to a response variable yy in terms of an nn-dimensional input vector xx. This is extremely well-studied with many applications and has sample and runtime complexity Θ(nd)\Theta(n^d). Can one achieve better runtime if the intrinsic dimension of the data is much smaller than the ambient dimension nn? Concretely, we are given samples (x,y)(x,y) where yy is a degree at most dd polynomial in an unknown rr-dimensional projection (the relevant dimensions) of xx. This can be seen both as a generalization of phase retrieval and as a special case of learning multi-index models where the link function is an unknown low-degree polynomial. Note that without distributional assumptions, this is at least as hard as junta learning. In this work we consider the important case where the covariates are Gaussian. We give an algorithm that learns the polynomial within accuracy ϵ\epsilon with sample complexity that is roughly N=Or,d(nlog2(1/ϵ)(logn)d)N = O_{r,d}(n \log^2(1/\epsilon) (\log n)^d) and runtime Or,d(Nn2)O_{r,d}(N n^2). Prior to our work, no such results were known even for the case of r=1r=1. We introduce a new filtered PCA approach to get a warm start for the true subspace and use geodesic SGD to boost to arbitrary accuracy; our techniques may be of independent interest, especially for problems dealing with subspace recovery or analyzing SGD on manifolds.Comment: 64 page

    Extreme Multi-label Classification from Aggregated Labels

    Full text link
    Extreme multi-label classification (XMC) is the problem of finding the relevant labels for an input, from a very large universe of possible labels. We consider XMC in the setting where labels are available only for groups of samples - but not for individual ones. Current XMC approaches are not built for such multi-instance multi-label (MIML) training data, and MIML approaches do not scale to XMC sizes. We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem. We characterize the statistical properties of our algorithm under mild assumptions, and provide a new end-to-end framework for MIML as an extension. Experiments on both aggregated label XMC and MIML tasks show the advantages over existing approaches

    Robust Meta-learning for Mixed Linear Regression with Small Batches

    Full text link
    A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with Ω(k1/2)\Omega(k^{1/2}) examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad (i) even with only a few outliers in the dataset; or (ii) even if the medium-sized tasks are slightly smaller with o(k1/2)o(k^{1/2}) examples each. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statistical trade-off; the lack of Ω(k1/2)\Omega(k^{1/2})-size tasks can be compensated for with smaller tasks, which can now be as small as O(logk)O(\log k).Comment: 52 pages, 2 figure
    corecore