487 research outputs found
A Knowledge Transfer Framework for Differentially Private Sparse Learning
We study the problem of estimating high dimensional models with underlying
sparse structures while preserving the privacy of each training example. We
develop a differentially private high-dimensional sparse learning framework
using the idea of knowledge transfer. More specifically, we propose to distill
the knowledge from a "teacher" estimator trained on a private dataset, by
creating a new dataset from auxiliary features, and then train a differentially
private "student" estimator using this new dataset. In addition, we establish
the linear convergence rate as well as the utility guarantee for our proposed
method. For sparse linear regression and sparse logistic regression, our method
achieves improved utility guarantees compared with the best known results
(Kifer et al., 2012; Wang and Gu, 2019). We further demonstrate the superiority
of our framework through both synthetic and real-world data experiments.Comment: 24 pages, 2 figures, 3 table
Minimax Optimality In High-Dimensional Classification, Clustering, And Privacy
The age of “Big Data” features large volume of massive and high-dimensional datasets, leading to fast emergence of different algorithms, as well as new concerns such as privacy and fairness. To compare different algorithms with (without) these new constraints, minimax decision theory provides a principled framework to quantify the optimality of algorithms and investigate the fundamental difficulty of statistical problems. Under the framework of minimax theory, this thesis aims to address the following four problems:
1. The first part of this thesis aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. In addition, we consider classification with incomplete data under the missing completely at random (MCR) model.
2. In the second part, we study high-dimensional sparse Quadratic Discriminant Analysis (QDA) and aim to establish the optimal convergence rates.
3. In the third part, we study the optimality of high-dimensional clustering on the unsupervised setting under the Gaussian mixtures model. We propose a EM-based procedure with the optimal rate of convergence for the excess mis-clustering error.
4. In the fourth part, we investigate the minimax optimality under the privacy constraint for mean estimation and linear regression models, under both the classical low-dimensional and modern high-dimensional settings
Compressed Regression
Recent research has studied the role of sparsity in high dimensional
regression and signal reconstruction, establishing theoretical limits for
recovering sparse models from sparse data. This line of work shows that
-regularized least squares regression can accurately estimate a sparse
linear model from noisy examples in dimensions, even if is much
larger than . In this paper we study a variant of this problem where the
original input variables are compressed by a random linear transformation
to examples in dimensions, and establish conditions under which a
sparse linear model can be successfully recovered from the compressed data. A
primary motivation for this compression procedure is to anonymize the data and
preserve privacy by revealing little information about the original data. We
characterize the number of random projections that are required for
-regularized compressed regression to identify the nonzero coefficients
in the true model with probability approaching one, a property called
``sparsistence.'' In addition, we show that -regularized compressed
regression asymptotically predicts as well as an oracle linear model, a
property called ``persistence.'' Finally, we characterize the privacy
properties of the compression procedure in information-theoretic terms,
establishing upper bounds on the mutual information between the compressed and
uncompressed data that decay to zero.Comment: 59 pages, 5 figure, Submitted for revie
Empowering differential networks using Bayesian analysis
Differential networks (DN) are important tools for modeling the changes in conditional
dependencies between multiple samples. A Bayesian approach for estimating DNs, from
the classical viewpoint, is introduced with a computationally efficient threshold selection for
graphical model determination. The algorithm separately estimates the precision matrices
of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments
illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method
is applied to South African COVID-19 data to investigate the change in DN structure
between various phases of the pandemic.DATA AVAILABILITY STATEMENT : The data underlying
the results presented in the study are available
from https://archive.ics.uci.edu/ml/datasets/
spambase for the spambase dataset. The
corresponding COVID-19 data are available from
https://www.nicd.ac.za/diseases-a-z-index/diseaseindex-covid-19/surveillance-reports/ and https://
ourworldindata.org/coronavirus/country/southafrica.SUPPORTING INFORMATION : S1 File. Supplementary material. Contains a block Gibbs sampler, as well as, additional optimal threshold; adjacency heatmaps and graphical network figures for dimensions p = 30 and p = 100.
https://doi.org/10.1371/journal.pone.0261193.s001The National Research Foundation (NRF) of South Africa.http://www.plosone.orgdm2022Statistic
- …