16,237 research outputs found

    Differential Private Stack Generalization with an Application to Diabetes Prediction

    Full text link
    To meet the standard of differential privacy, noise is usually added into the original data, which inevitably deteriorates the predicting performance of subsequent learning algorithms. In this paper, motivated by the success of improving predicting performance by ensemble learning, we propose to enhance privacy-preserving logistic regression by stacking. We show that this can be done either by sample-based or feature-based partitioning. However, we prove that when privacy-budgets are the same, feature-based partitioning requires fewer samples than sample-based one, and thus likely has better empirical performance. As transfer learning is difficult to be integrated with a differential privacy guarantee, we further combine the proposed method with hypothesis transfer learning to address the problem of learning across different organizations. Finally, we not only demonstrate the effectiveness of our method on two benchmark data sets, i.e., MNIST and NEWS20, but also apply it into a real application of cross-organizational diabetes prediction from RUIJIN data set, where privacy is of significant concern

    Learning Privately from Multiparty Data

    Full text link
    Learning a classifier from private data collected by multiple parties is an important problem that has many potential applications. How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data? We propose to transfer the `knowledge' of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global ϵ\epsilon-differentially private classifier. We show that majority voting is too sensitive and therefore propose a new risk weighted by class probabilities estimated from the ensemble. Relative to a non-private solution, our private solution has a generalization error bounded by O(ϵ−2M−2)O(\epsilon^{-2}M^{-2}) where MM is the number of parties. This allows strong privacy without performance loss when MM is large, such as in crowdsensing applications. We demonstrate the performance of our method with realistic tasks of activity recognition, network intrusion detection, and malicious URL detection

    Model-Agnostic Private Learning via Stability

    Full text link
    We design differentially private learning algorithms that are agnostic to the learning model. Our algorithms are interactive in nature, i.e., instead of outputting a model based on the training data, they provide predictions for a set of mm feature vectors that arrive online. We show that, for the feature vectors on which an ensemble of models (trained on random disjoint subsets of a dataset) makes consistent predictions, there is almost no-cost of privacy in generating accurate predictions for those feature vectors. To that end, we provide a novel coupling of the distance to instability framework with the sparse vector technique. We provide algorithms with formal privacy and utility guarantees for both binary/multi-class classification, and soft-label classification. For binary classification in the standard (agnostic) PAC model, we show how to bootstrap from our privately generated predictions to construct a computationally efficient private learner that outputs a final accurate hypothesis. Our construction - to the best of our knowledge - is the first computationally efficient construction for a label-private learner. We prove sample complexity upper bounds for this setting. As in non-private sample complexity bounds, the only relevant property of the given concept class is its VC dimension. For soft-label classification, our techniques are based on exploiting the stability properties of traditional learning algorithms, like stochastic gradient descent (SGD). We provide a new technique to boost the average-case stability properties of learning algorithms to strong (worst-case) stability properties, and then exploit them to obtain private classification algorithms. In the process, we also show that a large class of SGD methods satisfy average-case stability properties, in contrast to a smaller class of SGD methods that are uniformly stable as shown in prior work

    Information, Privacy and Stability in Adaptive Data Analysis

    Full text link
    Traditional statistical theory assumes that the analysis to be performed on a given data set is selected independently of the data themselves. This assumption breaks downs when data are re-used across analyses and the analysis to be performed at a given stage depends on the results of earlier stages. Such dependency can arise when the same data are used by several scientific studies, or when a single analysis consists of multiple stages. How can we draw statistically valid conclusions when data are re-used? This is the focus of a recent and active line of work. At a high level, these results show that limiting the information revealed by earlier stages of analysis controls the bias introduced in later stages by adaptivity. Here we review some known results in this area and highlight the role of information-theoretic concepts, notably several one-shot notions of mutual information.Comment: 15 pages, first drafted February 2017. A version of this survey appears in the Information Theory Society Newslette

    Private Identity Testing for High-Dimensional Distributions

    Full text link
    In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in Rd\mathbb{R}^d with known covariance and product distributions over {±1}d\{\pm 1\}^{d}. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of O(d1/2/α2)O(d^{1/2}/\alpha^2) in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.Comment: Improved the bounds and the writin

    Mitigating Bias in Adaptive Data Gathering via Differential Privacy

    Full text link
    Data that is gathered adaptively --- via bandit algorithms, for example --- exhibits bias. This is true both when gathering simple numeric valued data --- the empirical means kept track of by stochastic bandit algorithms are biased downwards --- and when gathering more complicated data --- running hypothesis tests on complex data gathered via contextual bandit algorithms leads to false discovery. In this paper, we show that this problem is mitigated if the data collection procedure is differentially private. This lets us both bound the bias of simple numeric valued quantities (like the empirical means of stochastic bandit algorithms), and correct the p-values of hypothesis tests run on the adaptively gathered data. Moreover, there exist differentially private bandit algorithms with near optimal regret bounds: we apply existing theorems in the simple stochastic case, and give a new analysis for linear contextual bandits. We complement our theoretical results with experiments validating our theory.Comment: Conference version appears in ICML 201

    Boosting Model Performance through Differentially Private Model Aggregation

    Full text link
    A key factor in developing high performing machine learning models is the availability of sufficiently large datasets. This work is motivated by applications arising in Software as a Service (SaaS) companies where there exist numerous similar yet disjoint datasets from multiple client companies. To overcome the challenges of insufficient data without explicitly aggregating the clients' datasets due to privacy concerns, one solution is to collect more data for each individual client, another is to privately aggregate information from models trained on each client's data. In this work, two approaches for private model aggregation are proposed that enable the transfer of knowledge from existing models trained on other companies' datasets to a new company with limited labeled data while protecting each client company's underlying individual sensitive information. The two proposed approaches are based on state-of-the-art private learning algorithms: Differentially Private Permutation-based Stochastic Gradient Descent and Approximate Minima Perturbation. We empirically show that by leveraging differentially private techniques, we can enable private model aggregation and augment data utility while providing provable mathematical guarantees on privacy. The proposed methods thus provide significant business value for SaaS companies and their clients, specifically as a solution for the cold-start problem

    Statistical Queries and Statistical Algorithms: Foundations and Applications

    Full text link
    We give a survey of the foundations of statistical queries and their many applications to other areas. We introduce the model, give the main definitions, and we explore the fundamental theory statistical queries and how how it connects to various notions of learnability. We also give a detailed summary of some of the applications of statistical queries to other areas, including to optimization, to evolvability, and to differential privacy.Comment: 21 page

    Privacy-preserving Active Learning on Sensitive Data for User Intent Classification

    Full text link
    Active learning holds promise of significantly reducing data annotation costs while maintaining reasonable model performance. However, it requires sending data to annotators for labeling. This presents a possible privacy leak when the training set includes sensitive user data. In this paper, we describe an approach for carrying out privacy preserving active learning with quantifiable guarantees. We evaluate our approach by showing the tradeoff between privacy, utility and annotation budget on a binary classification task in a active learning setting.Comment: To appear at PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies as part of the AAAI Spring Symposium Series (AAAI-SSS 2019

    Differentially Private Distributed Learning for Language Modeling Tasks

    Full text link
    One of the big challenges in machine learning applications is that training data can be different from the real-world data faced by the algorithm. In language modeling, users' language (e.g. in private messaging) could change in a year and be completely different from what we observe in publicly available data. At the same time, public data can be used for obtaining general knowledge (i.e. general model of English). We study approaches to distributed fine-tuning of a general model on user private data with the additional requirements of maintaining the quality on the general data and minimization of communication costs. We propose a novel technique that significantly improves prediction quality on users' language compared to a general model and outperforms gradient compression methods in terms of communication efficiency. The proposed procedure is fast and leads to an almost 70% perplexity reduction and 8.7 percentage point improvement in keystroke saving rate on informal English texts. We also show that the range of tasks our approach is applicable to is not limited by language modeling only. Finally, we propose an experimental framework for evaluating differential privacy of distributed training of language models and show that our approach has good privacy guarantees
    • …
    corecore