17 research outputs found

    Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models

    Full text link
    We consider {\em Mixed Linear Regression (MLR)}, where training data have been generated from a mixture of distinct linear models (or clusters) and we seek to identify the corresponding coefficient vectors. We introduce a {\em Mixed Integer Programming (MIP)} formulation for MLR subject to regularization constraints on the coefficient vectors. We establish that as the number of training samples grows large, the MIP solution converges to the true coefficient vectors in the absence of noise. Subject to slightly stronger assumptions, we also establish that the MIP identifies the clusters from which the training samples were generated. In the special case where training data come from a single cluster, we establish that the corresponding MIP yields a solution that converges to the true coefficient vector even when training data are perturbed by (martingale difference) noise. We provide a counterexample indicating that in the presence of noise, the MIP may fail to produce the true coefficient vectors for more than one clusters. We also provide numerical results testing the MIP solutions in synthetic examples with noise

    Learning Mixtures of Linear Regressions with Nearly Optimal Complexity

    Full text link
    Mixtures of Linear Regressions (MLR) is an important mixture model with many applications. In this model, each observation is generated from one of the several unknown linear regression components, where the identity of the generated component is also unknown. Previous works either assume strong assumptions on the data distribution or have high complexity. This paper proposes a fixed parameter tractable algorithm for the problem under general conditions, which achieves global convergence and the sample complexity scales nearly linearly in the dimension. In particular, different from previous works that require the data to be from the standard Gaussian, the algorithm allows the data from Gaussians with different covariances. When the conditional number of the covariances and the number of components are fixed, the algorithm has nearly optimal sample complexity N=O~(d)N = \tilde{O}(d) as well as nearly optimal computational complexity O~(Nd)\tilde{O}(Nd), where dd is the dimension of the data space. To the best of our knowledge, this approach provides the first such recovery guarantee for this general setting.Comment: Fix some typesetting issue in v

    Estimating the Coefficients of a Mixture of Two Linear Regressions by Expectation Maximization

    Full text link
    We give convergence guarantees for estimating the coefficients of a symmetric mixture of two linear regressions by expectation maximization (EM). In particular, we show that the empirical EM iterates converge to the target parameter vector at the parametric rate, provided the algorithm is initialized in an unbounded cone. In particular, if the initial guess has a sufficiently large cosine angle with the target parameter vector, a sample-splitting version of the EM algorithm converges to the true coefficient vector with high probability. Interestingly, our analysis borrows from tools used in the problem of estimating the centers of a symmetric mixture of two Gaussians by EM. We also show that the population EM operator for mixtures of two regressions is anti-contractive from the target parameter vector if the cosine angle between the input vector and the target parameter vector is too small, thereby establishing the necessity of our conic condition. Finally, we give empirical evidence supporting this theoretical observation, which suggests that the sample based EM algorithm performs poorly when initial guesses are drawn accordingly. Our simulation study also suggests that the EM algorithm performs well even under model misspecification (i.e., when the covariate and error distributions violate the model assumptions)

    Learning with Bad Training Data via Iterative Trimmed Loss Minimization

    Full text link
    In this paper, we study a simple and generic framework to tackle the problem of learning model parameters when a fraction of the training samples are corrupted. We first make a simple observation: in a variety of such settings, the evolution of training accuracy (as a function of training epochs) is different for clean and bad samples. Based on this we propose to iteratively minimize the trimmed loss, by alternating between (a) selecting samples with lowest current loss, and (b) retraining a model on only these samples. We prove that this process recovers the ground truth (with linear convergence rate) in generalized linear models with standard statistical assumptions. Experimentally, we demonstrate its effectiveness in three settings: (a) deep image classifiers with errors only in labels, (b) generative adversarial networks with bad training images, and (c) deep image classifiers with adversarial (image, label) pairs (i.e., backdoor attacks). For the well-studied setting of random label noise, our algorithm achieves state-of-the-art performance without having access to any a-priori guaranteed clean samples

    List-Decodable Linear Regression

    Full text link
    We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than 1/21/2 fraction of examples. For any Ξ±<1\alpha < 1, our algorithm takes as input a sample {(xi,yi)}i≀n\{(x_i,y_i)\}_{i \leq n} of nn linear equations where Ξ±n\alpha n of the equations satisfy yi=⟨xi,β„“βˆ—βŸ©+ΞΆy_i = \langle x_i,\ell^*\rangle +\zeta for some small noise ΞΆ\zeta and (1βˆ’Ξ±)n(1-\alpha)n of the equations are {\em arbitrarily} chosen. It outputs a list LL of size O(1/Ξ±)O(1/\alpha) - a fixed constant - that contains an β„“\ell that is close to β„“βˆ—\ell^*. Our algorithm succeeds whenever the inliers are chosen from a \emph{certifiably} anti-concentrated distribution DD. In particular, this gives a (d/Ξ±)O(1/Ξ±8)(d/\alpha)^{O(1/\alpha^8)} time algorithm to find a O(1/Ξ±)O(1/\alpha) size list when the inlier distribution is standard Gaussian. For discrete product distributions that are anti-concentrated only in \emph{regular} directions, we give an algorithm that achieves similar guarantee under the promise that β„“βˆ—\ell^* has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary. Our algorithm is based on a new framework for list-decodable learning that strengthens the `identifiability to algorithms' paradigm based on the sum-of-squares method. In an independent and concurrent work, Raghavendra and Yau also used the Sum-of-Squares method to give a similar result for list-decodable regression.Comment: 28 Page

    Subspace Embedding and Linear Regression with Orlicz Norm

    Full text link
    We consider a generalization of the classic linear regression problem to the case when the loss is an Orlicz norm. An Orlicz norm is parameterized by a non-negative convex function G:R+β†’R+G:\mathbb{R}_+\rightarrow\mathbb{R}_+ with G(0)=0G(0)=0: the Orlicz norm of a vector x∈Rnx\in\mathbb{R}^n is defined as βˆ₯xβˆ₯G=inf⁑{Ξ±>0βˆ£βˆ‘i=1nG(∣xi∣/Ξ±)≀1}. \|x\|_G=\inf\left\{\alpha>0\large\mid\sum_{i=1}^n G(|x_i|/\alpha)\leq 1\right\}. We consider the cases where the function G(β‹…)G(\cdot) grows subquadratically. Our main result is based on a new oblivious embedding which embeds the column space of a given matrix A∈RnΓ—dA\in\mathbb{R}^{n\times d} with Orlicz norm into a lower dimensional space with β„“2\ell_2 norm. Specifically, we show how to efficiently find an embedding matrix S∈RmΓ—n,m<nS\in\mathbb{R}^{m\times n},m<n such that βˆ€x∈Rd,Ξ©(1/(dlog⁑n))β‹…βˆ₯Axβˆ₯G≀βˆ₯SAxβˆ₯2≀O(d2log⁑n)β‹…βˆ₯Axβˆ₯G.\forall x\in\mathbb{R}^{d},\Omega(1/(d\log n)) \cdot \|Ax\|_G\leq \|SAx\|_2\leq O(d^2\log n) \cdot \|Ax\|_G. By applying this subspace embedding technique, we show an approximation algorithm for the regression problem min⁑x∈Rdβˆ₯Axβˆ’bβˆ₯G\min_{x\in\mathbb{R}^d} \|Ax-b\|_G, up to a O(dlog⁑2n)O(d\log^2 n) factor. As a further application of our techniques, we show how to also use them to improve on the algorithm for the β„“p\ell_p low rank matrix approximation problem for 1≀p<21\leq p<2.Comment: ICML 201

    Tensor Methods for Additive Index Models under Discordance and Heterogeneity

    Full text link
    Motivated by the sampling problems and heterogeneity issues common in high- dimensional big datasets, we consider a class of discordant additive index models. We propose method of moments based procedures for estimating the indices of such discordant additive index models in both low and high-dimensional settings. Our estimators are based on factorizing certain moment tensors and are also applicable in the overcomplete setting, where the number of indices is more than the dimensionality of the datasets. Furthermore, we provide rates of convergence of our estimator in both high and low-dimensional setting. Establishing such results requires deriving tensor operator norm concentration inequalities that might be of independent interest. Finally, we provide simulation results supporting our theory. Our contributions extend the applicability of tensor methods for novel models in addition to making progress on understanding theoretical properties of such tensor methods

    Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels

    Full text link
    In this paper, we consider parameter recovery for non-overlapping convolutional neural networks (CNNs) with multiple kernels. We show that when the inputs follow Gaussian distribution and the sample size is sufficiently large, the squared loss of such CNNs is Β locallyΒ stronglyΒ convex\mathit{~locally~strongly~convex} in a basin of attraction near the global optima for most popular activation functions, like ReLU, Leaky ReLU, Squared ReLU, Sigmoid and Tanh. The required sample complexity is proportional to the dimension of the input and polynomial in the number of kernels and a condition number of the parameters. We also show that tensor methods are able to initialize the parameters to the local strong convex region. Hence, for most smooth activations, gradient descent following tensor initialization is guaranteed to converge to the global optimal with time that is linear in input dimension, logarithmic in precision and polynomial in other factors. To the best of our knowledge, this is the first work that provides recovery guarantees for CNNs with multiple kernels under polynomial sample and computational complexities.Comment: arXiv admin note: text overlap with arXiv:1706.0317

    Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression

    Full text link
    The Expectation-Maximization algorithm is perhaps the most broadly used algorithm for inference of latent variable problems. A theoretical understanding of its performance, however, largely remains lacking. Recent results established that EM enjoys global convergence for Gaussian Mixture Models. For Mixed Linear Regression, however, only local convergence results have been established, and those only for the high SNR regime. We show here that EM converges for mixed linear regression with two components (it is known that it may fail to converge for three or more), and moreover that this convergence holds for random initialization. Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas.Comment: To appear in the proceedings of the Conference on Learning Theory (COLT), 2019. This paper results from a merger of work from two groups who work on the problem at the same tim

    Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning

    Full text link
    Constructing good representations is critical for learning complex tasks in a sample efficient manner. In the context of meta-learning, representations can be constructed from common patterns of previously seen tasks so that a future task can be learned quickly. While recent works show the benefit of subspace-based representations, such results are limited to linear-regression tasks. This work explores a more general class of nonlinear tasks with applications ranging from binary classification, generalized linear models and neural nets. We prove that subspace-based representations can be learned in a sample-efficient manner and provably benefit future tasks in terms of sample complexity. Numerical results verify the theoretical predictions in classification and neural-network regression tasks.Comment: To appear in ICASSP 21
    corecore