Search CORE

69 research outputs found

Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions

Author: Brown Gavin
Hopkins Samuel B.
Smith Adam
Publication venue
Publication date: 25/04/2023
Field of study

We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given

n

samples from a (sub-)Gaussian distribution with unknown mean

\mu

and covariance

\Sigma

, our

(\varepsilon,\delta)

-differentially private estimator produces

\tilde{\mu}

such that

\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha

as long as

n \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \varepsilon}+\frac{d\log 1/\delta}{\varepsilon}

. The Mahalanobis error metric

\|\mu - \hat{\mu}\|_{\Sigma}

measures the distance between

\hat \mu

and

\mu

relative to

\Sigma

; it characterizes the error of the sample mean. Our algorithm runs in time

\tilde{O}(nd^{\omega - 1} + nd/\varepsilon)

, where

\omega < 2.38

is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With

n\gtrsim d^{3/2}

samples, our estimate is accurate in spectral norm. This is the first such algorithm using

n= o(d^2)

samples, answering an open question posed by Alabi et al. (2022). With

n\gtrsim d^2

samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.Comment: 44 pages. New version fixes typos and includes additional exposition and discussion of related wor

arXiv.org e-Print Archive

Privately Estimating a Gaussian: Efficient, Robust and Optimal

Author: Alabi Daniel
Kothari Pravesh K.
Tankala Pranay
Venkat Prayaag
Zhang Fred
Publication venue
Publication date: 01/06/2023
Field of study

In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown

d

-dimensional Gaussian distribution up to an arbitrary tiny total variation error using

\widetilde{O}(d^2 \log \kappa)

samples while tolerating a constant fraction of adversarial outliers. Here,

\kappa

is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number

\kappa

in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using

\widetilde{O}(d^2)

samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of

\widetilde O(d)

, improving on a

\widetilde O(d^{1.5})

bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022]

arXiv.org e-Print Archive

Robust classification via MOM minimization

Author: Lecué Guillaume
Lerasle Matthieu
Mathieu Timothée
Publication venue
Publication date: 09/08/2018
Field of study

We present an extension of Vapnik's classical empirical risk minimizer (ERM) where the empirical risk is replaced by a median-of-means (MOM) estimator, the new estimators are called MOM minimizers. While ERM is sensitive to corruption of the dataset for many classical loss functions used in classification, we show that MOM minimizers behave well in theory, in the sense that it achieves Vapnik's (slow) rates of convergence under weak assumptions: data are only required to have a finite second moment and some outliers may also have corrupted the dataset. We propose an algorithm inspired by MOM minimizers. These algorithms can be analyzed using arguments quite similar to those used for Stochastic Block Gradient descent. As a proof of concept, we show how to modify a proof of consistency for a descent algorithm to prove consistency of its MOM version. As MOM algorithms perform a smart subsampling, our procedure can also help to reduce substantially time computations and memory ressources when applied to non linear algorithms. These empirical performances are illustrated on both simulated and real datasets

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Large-Scale Nonparametric and Semiparametric Inference for Large, Complex, and Noisy Datasets

Author: Han Fang
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 09/01/2018
Field of study

Massive Data bring new opportunities and challenges to data scientists and statisticians. On one hand, Massive Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the size and dimensionality of Massive Data introduce unique statistical challenges and consequences for model misspecification. Some important factors are as follows. Complexity: Since Massive Data are often aggregated from multiple sources, they often exhibit heavy-tailedness behavior with nontrivial tail dependence. Noise: Massive Data usually contain various types of measurement error, outliers, and missing values. Dependence: In many data types, such as financial time series, functional magnetic resonance image (fMRI), and time course microarray data, the samples are dependent with relatively weak signals. These challenges are difficult to address and require new computational and statistical tools. More specifically, to handle these challenges, it is necessary to develop statistical methods that are robust to data complexity, noise, and dependence. Our work aims to make headway in resolving these issues. Notably, we give a unified framework for analyzing high dimensional, complex, noisy datasets having temporal/spatial dependence. The proposed methods enjoy good theoretical properties. Their empirical usefulness is also verified in large-scale neuroimage and financial data analysis

JScholarship

Online Robust Mean Estimation

Author: Diakonikolas Ilias
Kane Daniel M.
Liu Sihan
Xiao Hanshen
Publication venue
Publication date: 24/10/2023
Field of study

We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where

n

sensors are measuring some common, ongoing phenomenon. At each time step

t=1,2,\ldots,T

, the

i^{th}

sensor reports its readings

x^{(i)}_t

for that time step. The algorithm must then commit to its estimate

\mu_t

for the true mean value of the process at time

t

. We assume that most of the sensors observe independent samples from some common distribution

X

, but an

\epsilon

-fraction of them may instead behave maliciously. The algorithm wishes to compute a good approximation

\mu

to the true mean

\mu^\ast := \mathbf{E}[X]

. We note that if the algorithm is allowed to wait until time

T

to report its estimate, this reduces to the well-studied problem of robust mean estimation. However, the requirement that our algorithm produces partial estimates as the data is coming in substantially complicates the situation. We prove two main results about online robust mean estimation in this model. First, if the uncorrupted samples satisfy the standard condition of

(\epsilon,\delta)

-stability, we give an efficient online algorithm that outputs estimates

\mu_t

t \in [T],

such that with high probability it holds that

\|\mu-\mu^\ast\|_2 = O(\delta \log(T))

, where

\mu = (\mu_t)_{t \in [T]}

. We note that this error bound is nearly competitive with the best offline algorithms, which would achieve

\ell_2

-error of

O(\delta)

. Our second main result shows that with additional assumptions on the input (most notably that

X

is a product distribution) there are inefficient algorithms whose error does not depend on

T

at all.Comment: To appear in SODA202

arXiv.org e-Print Archive

Robust Methods for High-Dimensional Linear Learning

Author: Gaïffas Stéphane
Merad Ibrahim
Publication venue
Publication date: 29/05/2023
Field of study

We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features

d

may exceed the sample size

n

. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla

s

-sparsity, we are able to reach the

s\log (d)/n

rate under heavy-tails and

\eta

-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source

\mathtt{Python}

library called

\mathtt{linlearn}

, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.Comment: accepted versio

arXiv.org e-Print Archive