487 research outputs found

    A Knowledge Transfer Framework for Differentially Private Sparse Learning

    Full text link
    We study the problem of estimating high dimensional models with underlying sparse structures while preserving the privacy of each training example. We develop a differentially private high-dimensional sparse learning framework using the idea of knowledge transfer. More specifically, we propose to distill the knowledge from a "teacher" estimator trained on a private dataset, by creating a new dataset from auxiliary features, and then train a differentially private "student" estimator using this new dataset. In addition, we establish the linear convergence rate as well as the utility guarantee for our proposed method. For sparse linear regression and sparse logistic regression, our method achieves improved utility guarantees compared with the best known results (Kifer et al., 2012; Wang and Gu, 2019). We further demonstrate the superiority of our framework through both synthetic and real-world data experiments.Comment: 24 pages, 2 figures, 3 table

    Minimax Optimality In High-Dimensional Classification, Clustering, And Privacy

    Get PDF
    The age of “Big Data” features large volume of massive and high-dimensional datasets, leading to fast emergence of different algorithms, as well as new concerns such as privacy and fairness. To compare different algorithms with (without) these new constraints, minimax decision theory provides a principled framework to quantify the optimality of algorithms and investigate the fundamental difficulty of statistical problems. Under the framework of minimax theory, this thesis aims to address the following four problems: 1. The first part of this thesis aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. In addition, we consider classification with incomplete data under the missing completely at random (MCR) model. 2. In the second part, we study high-dimensional sparse Quadratic Discriminant Analysis (QDA) and aim to establish the optimal convergence rates. 3. In the third part, we study the optimality of high-dimensional clustering on the unsupervised setting under the Gaussian mixtures model. We propose a EM-based procedure with the optimal rate of convergence for the excess mis-clustering error. 4. In the fourth part, we investigate the minimax optimality under the privacy constraint for mean estimation and linear regression models, under both the classical low-dimensional and modern high-dimensional settings

    Compressed Regression

    Full text link
    Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. This line of work shows that 1\ell_1-regularized least squares regression can accurately estimate a sparse linear model from nn noisy examples in pp dimensions, even if pp is much larger than nn. In this paper we study a variant of this problem where the original nn input variables are compressed by a random linear transformation to mnm \ll n examples in pp dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1\ell_1-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called ``sparsistence.'' In addition, we show that 1\ell_1-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called ``persistence.'' Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the mutual information between the compressed and uncompressed data that decay to zero.Comment: 59 pages, 5 figure, Submitted for revie

    Empowering differential networks using Bayesian analysis

    Get PDF
    Differential networks (DN) are important tools for modeling the changes in conditional dependencies between multiple samples. A Bayesian approach for estimating DNs, from the classical viewpoint, is introduced with a computationally efficient threshold selection for graphical model determination. The algorithm separately estimates the precision matrices of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method is applied to South African COVID-19 data to investigate the change in DN structure between various phases of the pandemic.DATA AVAILABILITY STATEMENT : The data underlying the results presented in the study are available from https://archive.ics.uci.edu/ml/datasets/ spambase for the spambase dataset. The corresponding COVID-19 data are available from https://www.nicd.ac.za/diseases-a-z-index/diseaseindex-covid-19/surveillance-reports/ and https:// ourworldindata.org/coronavirus/country/southafrica.SUPPORTING INFORMATION : S1 File. Supplementary material. Contains a block Gibbs sampler, as well as, additional optimal threshold; adjacency heatmaps and graphical network figures for dimensions p = 30 and p = 100. https://doi.org/10.1371/journal.pone.0261193.s001The National Research Foundation (NRF) of South Africa.http://www.plosone.orgdm2022Statistic
    corecore