69 research outputs found

    Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions

    Full text link
    We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given nn samples from a (sub-)Gaussian distribution with unknown mean μ\mu and covariance Σ\Sigma, our (ε,δ)(\varepsilon,\delta)-differentially private estimator produces μ~\tilde{\mu} such that μμ~Σα\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha as long as ndα2+dlog1/δαε+dlog1/δεn \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \varepsilon}+\frac{d\log 1/\delta}{\varepsilon}. The Mahalanobis error metric μμ^Σ\|\mu - \hat{\mu}\|_{\Sigma} measures the distance between μ^\hat \mu and μ\mu relative to Σ\Sigma; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω1+nd/ε)\tilde{O}(nd^{\omega - 1} + nd/\varepsilon), where ω<2.38\omega < 2.38 is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With nd3/2n\gtrsim d^{3/2} samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2)n= o(d^2) samples, answering an open question posed by Alabi et al. (2022). With nd2n\gtrsim d^2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.Comment: 44 pages. New version fixes typos and includes additional exposition and discussion of related wor

    Privately Estimating a Gaussian: Efficient, Robust and Optimal

    Full text link
    In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown dd-dimensional Gaussian distribution up to an arbitrary tiny total variation error using O~(d2logκ)\widetilde{O}(d^2 \log \kappa) samples while tolerating a constant fraction of adversarial outliers. Here, κ\kappa is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number κ\kappa in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using O~(d2)\widetilde{O}(d^2) samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of O~(d)\widetilde O(d), improving on a O~(d1.5)\widetilde O(d^{1.5}) bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022]

    Robust classification via MOM minimization

    Full text link
    We present an extension of Vapnik's classical empirical risk minimizer (ERM) where the empirical risk is replaced by a median-of-means (MOM) estimator, the new estimators are called MOM minimizers. While ERM is sensitive to corruption of the dataset for many classical loss functions used in classification, we show that MOM minimizers behave well in theory, in the sense that it achieves Vapnik's (slow) rates of convergence under weak assumptions: data are only required to have a finite second moment and some outliers may also have corrupted the dataset. We propose an algorithm inspired by MOM minimizers. These algorithms can be analyzed using arguments quite similar to those used for Stochastic Block Gradient descent. As a proof of concept, we show how to modify a proof of consistency for a descent algorithm to prove consistency of its MOM version. As MOM algorithms perform a smart subsampling, our procedure can also help to reduce substantially time computations and memory ressources when applied to non linear algorithms. These empirical performances are illustrated on both simulated and real datasets

    Large-Scale Nonparametric and Semiparametric Inference for Large, Complex, and Noisy Datasets

    Get PDF
    Massive Data bring new opportunities and challenges to data scientists and statisticians. On one hand, Massive Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the size and dimensionality of Massive Data introduce unique statistical challenges and consequences for model misspecification. Some important factors are as follows. Complexity: Since Massive Data are often aggregated from multiple sources, they often exhibit heavy-tailedness behavior with nontrivial tail dependence. Noise: Massive Data usually contain various types of measurement error, outliers, and missing values. Dependence: In many data types, such as financial time series, functional magnetic resonance image (fMRI), and time course microarray data, the samples are dependent with relatively weak signals. These challenges are difficult to address and require new computational and statistical tools. More specifically, to handle these challenges, it is necessary to develop statistical methods that are robust to data complexity, noise, and dependence. Our work aims to make headway in resolving these issues. Notably, we give a unified framework for analyzing high dimensional, complex, noisy datasets having temporal/spatial dependence. The proposed methods enjoy good theoretical properties. Their empirical usefulness is also verified in large-scale neuroimage and financial data analysis

    Online Robust Mean Estimation

    Full text link
    We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where nn sensors are measuring some common, ongoing phenomenon. At each time step t=1,2,,Tt=1,2,\ldots,T, the ithi^{th} sensor reports its readings xt(i)x^{(i)}_t for that time step. The algorithm must then commit to its estimate μt\mu_t for the true mean value of the process at time tt. We assume that most of the sensors observe independent samples from some common distribution XX, but an ϵ\epsilon-fraction of them may instead behave maliciously. The algorithm wishes to compute a good approximation μ\mu to the true mean μ:=E[X]\mu^\ast := \mathbf{E}[X]. We note that if the algorithm is allowed to wait until time TT to report its estimate, this reduces to the well-studied problem of robust mean estimation. However, the requirement that our algorithm produces partial estimates as the data is coming in substantially complicates the situation. We prove two main results about online robust mean estimation in this model. First, if the uncorrupted samples satisfy the standard condition of (ϵ,δ)(\epsilon,\delta)-stability, we give an efficient online algorithm that outputs estimates μt\mu_t, t[T],t \in [T], such that with high probability it holds that μμ2=O(δlog(T))\|\mu-\mu^\ast\|_2 = O(\delta \log(T)), where μ=(μt)t[T]\mu = (\mu_t)_{t \in [T]}. We note that this error bound is nearly competitive with the best offline algorithms, which would achieve 2\ell_2-error of O(δ)O(\delta). Our second main result shows that with additional assumptions on the input (most notably that XX is a product distribution) there are inefficient algorithms whose error does not depend on TT at all.Comment: To appear in SODA202

    Robust Methods for High-Dimensional Linear Learning

    Full text link
    We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features dd may exceed the sample size nn. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla ss-sparsity, we are able to reach the slog(d)/ns\log (d)/n rate under heavy-tails and η\eta-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source Python\mathtt{Python} library called linlearn\mathtt{linlearn}, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.Comment: accepted versio
    corecore