19,617 research outputs found
High-Dimensional Robust Mean Estimation via Gradient Descent
We study the problem of high-dimensional robust mean estimation in the
presence of a constant fraction of adversarial outliers. A recent line of work
has provided sophisticated polynomial-time algorithms for this problem with
dimension-independent error guarantees for a range of natural distribution
families.
In this work, we show that a natural non-convex formulation of the problem
can be solved directly by gradient descent. Our approach leverages a novel
structural lemma, roughly showing that any approximate stationary point of our
non-convex objective gives a near-optimal solution to the underlying robust
estimation task. Our work establishes an intriguing connection between
algorithmic high-dimensional robust statistics and non-convex optimization,
which may have broader applications to other robust estimation tasks.Comment: Under submission to ICML'2
Robust Estimation via Robust Gradient Estimation
We provide a new computationally-efficient class of estimators for risk
minimization. We show that these estimators are robust for general statistical
models: in the classical Huber epsilon-contamination model and in heavy-tailed
settings. Our workhorse is a novel robust variant of gradient descent, and we
provide conditions under which our gradient descent variant provides accurate
estimators in a general convex risk minimization problem. We provide specific
consequences of our theory for linear regression, logistic regression and for
estimation of the canonical parameters in an exponential family. These results
provide some of the first computationally tractable and provably robust
estimators for these canonical statistical models. Finally, we study the
empirical performance of our proposed methods on synthetic and real datasets,
and find that our methods convincingly outperform a variety of baselines.Comment: 48 pages, 5 figure
Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning
We study robust distributed learning that involves minimizing a non-convex
loss function with saddle points. We consider the Byzantine setting where some
worker machines have abnormal or even arbitrary and adversarial behavior. In
this setting, the Byzantine machines may create fake local minima near a saddle
point that is far away from any true local minimum, even when robust gradient
estimators are used. We develop ByzantinePGD, a robust first-order algorithm
that can provably escape saddle points and fake local minima, and converge to
an approximate true local minimizer with low iteration complexity. As a
by-product, we give a simpler algorithm and analysis for escaping saddle points
in the usual non-Byzantine setting. We further discuss three robust gradient
estimators that can be used in ByzantinePGD, including median, trimmed mean,
and iterative filtering. We characterize their performance in concrete
statistical settings, and argue for their near-optimality in low and high
dimensional regimes.Comment: ICML 201
Securing Distributed Gradient Descent in High Dimensional Statistical Learning
We consider unreliable distributed learning systems wherein the training data
is kept confidential by external workers, and the learner has to interact
closely with those workers to train a model. In particular, we assume that
there exists a system adversary that can adaptively compromise some workers;
the compromised workers deviate from their local designed specifications by
sending out arbitrarily malicious messages.
We assume in each communication round, up to out of the workers
suffer Byzantine faults. Each worker keeps a local sample of size and the
total sample size is . We propose a secured variant of the gradient
descent method that can tolerate up to a constant fraction of Byzantine
workers, i.e., . Moreover, we show the statistical estimation error
of the iterates converges in rounds to , where is the model dimension. As long as , our
proposed algorithm achieves the optimal error rate . Our results
are obtained under some technical assumptions. Specifically, we assume
strongly-convex population risk. Nevertheless, the empirical risk (sample
version) is allowed to be non-convex. The core of our method is to robustly
aggregate the gradients computed by the workers based on the filtering
procedure proposed by Steinhardt et al. On the technical front, deviating from
the existing literature on robustly estimating a finite-dimensional mean
vector, we establish a {\em uniform} concentration of the sample covariance
matrix of gradients, and show that the aggregated gradient, as a function of
model parameter, converges uniformly to the true gradient function. To get a
near-optimal uniform concentration bound, we develop a new matrix concentration
inequality, which might be of independent interest
Statistical consistency and asymptotic normality for high-dimensional robust M-estimators
We study theoretical properties of regularized robust M-estimators,
applicable when data are drawn from a sparse high-dimensional linear model and
contaminated by heavy-tailed distributions and/or outliers in the additive
errors and covariates. We first establish a form of local statistical
consistency for the penalized regression estimators under fairly mild
conditions on the error distribution: When the derivative of the loss function
is bounded and satisfies a local restricted curvature condition, all stationary
points within a constant radius of the true regression vector converge at the
minimax rate enjoyed by the Lasso with sub-Gaussian errors. When an appropriate
nonconvex regularizer is used in place of an l_1-penalty, we show that such
stationary points are in fact unique and equal to the local oracle solution
with the correct support---hence, results on asymptotic normality in the
low-dimensional case carry over immediately to the high-dimensional setting.
This has important implications for the efficiency of regularized nonconvex
M-estimators when the errors are heavy-tailed. Our analysis of the local
curvature of the loss function also has useful consequences for optimization
when the robust regression function and/or regularizer is nonconvex and the
objective function possesses stationary points outside the local region. We
show that as long as a composite gradient descent algorithm is initialized
within a constant radius of the true regression vector, successive iterates
will converge at a linear rate to a stationary point within the local region.
Furthermore, the global optimum of a convex regularized robust regression
function may be used to obtain a suitable initialization. The result is a novel
two-step procedure that uses a convex M-estimator to achieve consistency and a
nonconvex M-estimator to increase efficiency.Comment: 56 pages, 8 figure
Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation
Low-rank modeling plays a pivotal role in signal processing and machine
learning, with applications ranging from collaborative filtering, video
surveillance, medical imaging, to dimensionality reduction and adaptive
filtering. Many modern high-dimensional data and interactions thereof can be
modeled as lying approximately in a low-dimensional subspace or manifold,
possibly with additional structures, and its proper exploitations lead to
significant reduction of costs in sensing, computation and storage. In recent
years, there is a plethora of progress in understanding how to exploit low-rank
structures using computationally efficient procedures in a provable manner,
including both convex and nonconvex approaches. On one side, convex relaxations
such as nuclear norm minimization often lead to statistically optimal
procedures for estimating low-rank matrices, where first-order methods are
developed to address the computational challenges; on the other side, there is
emerging evidence that properly designed nonconvex procedures, such as
projected gradient descent, often provide globally optimal solutions with a
much lower computational cost in many problems. This survey article will
provide a unified overview of these recent advances on low-rank matrix
estimation from incomplete measurements. Attention is paid to rigorous
characterization of the performance of these algorithms, and to problems where
the low-rank matrix have additional structural properties that require new
algorithmic designs and theoretical analysis.Comment: To appear in IEEE Signal Processing Magazin
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
In large-scale distributed learning, security issues have become increasingly
important. Particularly in a decentralized environment, some computing units
may behave abnormally, or even exhibit Byzantine failures---arbitrary and
potentially adversarial behavior. In this paper, we develop distributed
learning algorithms that are provably robust against such failures, with a
focus on achieving optimal statistical performance. A main result of this work
is a sharp analysis of two robust distributed gradient descent algorithms based
on median and trimmed mean operations, respectively. We prove statistical error
rates for three kinds of population loss functions: strongly convex,
non-strongly convex, and smooth non-convex. In particular, these algorithms are
shown to achieve order-optimal statistical error rates for strongly convex
losses. To achieve better communication efficiency, we further propose a
median-based distributed algorithm that is provably robust, and uses only one
communication round. For strongly convex quadratic loss, we show that this
algorithm achieves the same optimal error rate as the robust distributed
gradient descent algorithms
Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression
We consider Hadamard product parametrization as a change-of-variable
(over-parametrization) technique for solving least square problems in the
context of linear regression. Despite the non-convexity and exponentially many
saddle points induced by the change-of-variable, we show that under certain
conditions, this over-parametrization leads to implicit regularization: if we
directly apply gradient descent to the residual sum of squares with
sufficiently small initial values, then under proper early stopping rule, the
iterates converge to a nearly sparse rate-optimal solution with relatively
better accuracy than explicit regularized approaches. In particular, the
resulting estimator does not suffer from extra bias due to explicit penalties,
and can achieve the parametric root- rate (independent of the dimension)
under proper conditions on the signal-to-noise ratio. We perform simulations to
compare our methods with high dimensional linear regression with explicit
regularizations. Our results illustrate advantages of using implicit
regularization via gradient descent after over-parametrization in sparse vector
estimation
The Landscape of Empirical Risk for Non-convex Losses
Most high-dimensional estimation and prediction methods propose to minimize a
cost function (empirical risk) that is written as a sum of losses associated to
each data point. In this paper we focus on the case of non-convex losses, which
is practically important but still poorly understood. Classical empirical
process theory implies uniform convergence of the empirical risk to the
population risk. While uniform convergence implies consistency of the resulting
M-estimator, it does not ensure that the latter can be computed efficiently.
In order to capture the complexity of computing M-estimators, we propose to
study the landscape of the empirical risk, namely its stationary points and
their properties. We establish uniform convergence of the gradient and Hessian
of the empirical risk to their population counterparts, as soon as the number
of samples becomes larger than the number of unknown parameters (modulo
logarithmic factors). Consequently, good properties of the population risk can
be carried to the empirical risk, and we can establish one-to-one
correspondence of their stationary points. We demonstrate that in several
problems such as non-convex binary classification, robust regression, and
Gaussian mixture model, this result implies a complete characterization of the
landscape of the empirical risk, and of the convergence properties of descent
algorithms.
We extend our analysis to the very high-dimensional setting in which the
number of parameters exceeds the number of samples, and provide a
characterization of the empirical risk landscape under a nearly
information-theoretically minimal condition. Namely, if the number of samples
exceeds the sparsity of the unknown parameters vector (modulo logarithmic
factors), then a suitable uniform convergence result takes place. We apply this
result to non-convex binary classification and robust regression in very
high-dimension.Comment: This version presents a general framework, and applies it to several
statistical learning problem
Sever: A Robust Meta-Algorithm for Stochastic Optimization
In high dimensions, most machine learning methods are brittle to even a small
fraction of structured outliers. To address this, we introduce a new
meta-algorithm that can take in a base learner such as least squares or
stochastic gradient descent, and harden the learner to be resistant to
outliers. Our method, Sever, possesses strong theoretical guarantees yet is
also highly scalable -- beyond running the base learner itself, it only
requires computing the top singular vector of a certain matrix. We
apply Sever on a drug design dataset and a spam classification dataset, and
find that in both cases it has substantially greater robustness than several
baselines. On the spam dataset, with corruptions, we achieved
test error, compared to for the baselines, and error on
the uncorrupted dataset. Similarly, on the drug design dataset, with
corruptions, we achieved mean-squared error test error, compared to
- for the baselines, and error on the uncorrupted dataset.Comment: To appear in ICML 201
- …