17 research outputs found
Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models
We consider {\em Mixed Linear Regression (MLR)}, where training data have
been generated from a mixture of distinct linear models (or clusters) and we
seek to identify the corresponding coefficient vectors. We introduce a {\em
Mixed Integer Programming (MIP)} formulation for MLR subject to regularization
constraints on the coefficient vectors. We establish that as the number of
training samples grows large, the MIP solution converges to the true
coefficient vectors in the absence of noise. Subject to slightly stronger
assumptions, we also establish that the MIP identifies the clusters from which
the training samples were generated. In the special case where training data
come from a single cluster, we establish that the corresponding MIP yields a
solution that converges to the true coefficient vector even when training data
are perturbed by (martingale difference) noise. We provide a counterexample
indicating that in the presence of noise, the MIP may fail to produce the true
coefficient vectors for more than one clusters. We also provide numerical
results testing the MIP solutions in synthetic examples with noise
Learning Mixtures of Linear Regressions with Nearly Optimal Complexity
Mixtures of Linear Regressions (MLR) is an important mixture model with many
applications. In this model, each observation is generated from one of the
several unknown linear regression components, where the identity of the
generated component is also unknown. Previous works either assume strong
assumptions on the data distribution or have high complexity. This paper
proposes a fixed parameter tractable algorithm for the problem under general
conditions, which achieves global convergence and the sample complexity scales
nearly linearly in the dimension. In particular, different from previous works
that require the data to be from the standard Gaussian, the algorithm allows
the data from Gaussians with different covariances. When the conditional number
of the covariances and the number of components are fixed, the algorithm has
nearly optimal sample complexity as well as nearly optimal
computational complexity , where is the dimension of the
data space. To the best of our knowledge, this approach provides the first such
recovery guarantee for this general setting.Comment: Fix some typesetting issue in v
Estimating the Coefficients of a Mixture of Two Linear Regressions by Expectation Maximization
We give convergence guarantees for estimating the coefficients of a symmetric
mixture of two linear regressions by expectation maximization (EM). In
particular, we show that the empirical EM iterates converge to the target
parameter vector at the parametric rate, provided the algorithm is initialized
in an unbounded cone. In particular, if the initial guess has a sufficiently
large cosine angle with the target parameter vector, a sample-splitting version
of the EM algorithm converges to the true coefficient vector with high
probability. Interestingly, our analysis borrows from tools used in the problem
of estimating the centers of a symmetric mixture of two Gaussians by EM. We
also show that the population EM operator for mixtures of two regressions is
anti-contractive from the target parameter vector if the cosine angle between
the input vector and the target parameter vector is too small, thereby
establishing the necessity of our conic condition. Finally, we give empirical
evidence supporting this theoretical observation, which suggests that the
sample based EM algorithm performs poorly when initial guesses are drawn
accordingly. Our simulation study also suggests that the EM algorithm performs
well even under model misspecification (i.e., when the covariate and error
distributions violate the model assumptions)
Learning with Bad Training Data via Iterative Trimmed Loss Minimization
In this paper, we study a simple and generic framework to tackle the problem
of learning model parameters when a fraction of the training samples are
corrupted. We first make a simple observation: in a variety of such settings,
the evolution of training accuracy (as a function of training epochs) is
different for clean and bad samples. Based on this we propose to iteratively
minimize the trimmed loss, by alternating between (a) selecting samples with
lowest current loss, and (b) retraining a model on only these samples. We prove
that this process recovers the ground truth (with linear convergence rate) in
generalized linear models with standard statistical assumptions.
Experimentally, we demonstrate its effectiveness in three settings: (a) deep
image classifiers with errors only in labels, (b) generative adversarial
networks with bad training images, and (c) deep image classifiers with
adversarial (image, label) pairs (i.e., backdoor attacks). For the well-studied
setting of random label noise, our algorithm achieves state-of-the-art
performance without having access to any a-priori guaranteed clean samples
List-Decodable Linear Regression
We give the first polynomial-time algorithm for robust regression in the
list-decodable setting where an adversary can corrupt a greater than
fraction of examples.
For any , our algorithm takes as input a sample of linear equations where of the equations satisfy for some small noise and
of the equations are {\em arbitrarily} chosen. It outputs a list
of size - a fixed constant - that contains an that is
close to .
Our algorithm succeeds whenever the inliers are chosen from a
\emph{certifiably} anti-concentrated distribution . In particular, this
gives a time algorithm to find a
size list when the inlier distribution is standard Gaussian. For discrete
product distributions that are anti-concentrated only in \emph{regular}
directions, we give an algorithm that achieves similar guarantee under the
promise that has all coordinates of the same magnitude. To complement
our result, we prove that the anti-concentration assumption on the inliers is
information-theoretically necessary.
Our algorithm is based on a new framework for list-decodable learning that
strengthens the `identifiability to algorithms' paradigm based on the
sum-of-squares method.
In an independent and concurrent work, Raghavendra and Yau also used the
Sum-of-Squares method to give a similar result for list-decodable regression.Comment: 28 Page
Subspace Embedding and Linear Regression with Orlicz Norm
We consider a generalization of the classic linear regression problem to the
case when the loss is an Orlicz norm. An Orlicz norm is parameterized by a
non-negative convex function with
: the Orlicz norm of a vector is defined as We consider the cases where the function grows
subquadratically. Our main result is based on a new oblivious embedding which
embeds the column space of a given matrix with
Orlicz norm into a lower dimensional space with norm. Specifically, we
show how to efficiently find an embedding matrix such that By applying this
subspace embedding technique, we show an approximation algorithm for the
regression problem , up to a
factor. As a further application of our techniques, we show how to also use
them to improve on the algorithm for the low rank matrix approximation
problem for .Comment: ICML 201
Tensor Methods for Additive Index Models under Discordance and Heterogeneity
Motivated by the sampling problems and heterogeneity issues common in high-
dimensional big datasets, we consider a class of discordant additive index
models. We propose method of moments based procedures for estimating the
indices of such discordant additive index models in both low and
high-dimensional settings. Our estimators are based on factorizing certain
moment tensors and are also applicable in the overcomplete setting, where the
number of indices is more than the dimensionality of the datasets. Furthermore,
we provide rates of convergence of our estimator in both high and
low-dimensional setting. Establishing such results requires deriving tensor
operator norm concentration inequalities that might be of independent interest.
Finally, we provide simulation results supporting our theory. Our contributions
extend the applicability of tensor methods for novel models in addition to
making progress on understanding theoretical properties of such tensor methods
Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels
In this paper, we consider parameter recovery for non-overlapping
convolutional neural networks (CNNs) with multiple kernels. We show that when
the inputs follow Gaussian distribution and the sample size is sufficiently
large, the squared loss of such CNNs is in
a basin of attraction near the global optima for most popular activation
functions, like ReLU, Leaky ReLU, Squared ReLU, Sigmoid and Tanh. The required
sample complexity is proportional to the dimension of the input and polynomial
in the number of kernels and a condition number of the parameters. We also show
that tensor methods are able to initialize the parameters to the local strong
convex region. Hence, for most smooth activations, gradient descent following
tensor initialization is guaranteed to converge to the global optimal with time
that is linear in input dimension, logarithmic in precision and polynomial in
other factors. To the best of our knowledge, this is the first work that
provides recovery guarantees for CNNs with multiple kernels under polynomial
sample and computational complexities.Comment: arXiv admin note: text overlap with arXiv:1706.0317
Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression
The Expectation-Maximization algorithm is perhaps the most broadly used
algorithm for inference of latent variable problems. A theoretical
understanding of its performance, however, largely remains lacking. Recent
results established that EM enjoys global convergence for Gaussian Mixture
Models. For Mixed Linear Regression, however, only local convergence results
have been established, and those only for the high SNR regime. We show here
that EM converges for mixed linear regression with two components (it is known
that it may fail to converge for three or more), and moreover that this
convergence holds for random initialization. Our analysis reveals that EM
exhibits very different behavior in Mixed Linear Regression from its behavior
in Gaussian Mixture Models, and hence our proofs require the development of
several new ideas.Comment: To appear in the proceedings of the Conference on Learning Theory
(COLT), 2019. This paper results from a merger of work from two groups who
work on the problem at the same tim
Sample Efficient Subspace-based Representations for Nonlinear Meta-Learning
Constructing good representations is critical for learning complex tasks in a
sample efficient manner. In the context of meta-learning, representations can
be constructed from common patterns of previously seen tasks so that a future
task can be learned quickly. While recent works show the benefit of
subspace-based representations, such results are limited to linear-regression
tasks. This work explores a more general class of nonlinear tasks with
applications ranging from binary classification, generalized linear models and
neural nets. We prove that subspace-based representations can be learned in a
sample-efficient manner and provably benefit future tasks in terms of sample
complexity. Numerical results verify the theoretical predictions in
classification and neural-network regression tasks.Comment: To appear in ICASSP 21