17,934 research outputs found

    Privacy Preserving Distributed Data Mining

    Get PDF
    Privacy preserving distributed data mining aims to design secure protocols which allow multiple parties to conduct collaborative data mining while protecting the data privacy. My research focuses on the design and implementation of privacy preserving two-party protocols based on homomorphic encryption. I present new results in this area, including new secure protocols for basic operations and two fundamental privacy preserving data mining protocols. I propose a number of secure protocols for basic operations in the additive secret-sharing scheme based on homomorphic encryption. I derive a basic relationship between a secret number and its shares, with which we develop efficient secure comparison and secure division with public divisor protocols. I also design a secure inverse square root protocol based on Newton\u27s iterative method and hence propose a solution for the secure square root problem. In addition, we propose a secure exponential protocol based on Taylor series expansions. All these protocols are implemented using secure multiplication and can be used to develop privacy preserving distributed data mining protocols. In particular, I develop efficient privacy preserving protocols for two fundamental data mining tasks: multiple linear regression and EM clustering. Both protocols work for arbitrarily partitioned datasets. The two-party privacy preserving linear regression protocol is provably secure in the semi-honest model, and the EM clustering protocol discloses only the number of iterations. I provide a proof-of-concept implementation of these protocols in C++, based on the Paillier cryptosystem

    sPLINK : a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies

    Get PDF
    Meta-analysis has been established as an effective approach to combining summary statistics of several genome-wide association studies (GWAS). However, the accuracy of meta-analysis can be attenuated in the presence of cross-study heterogeneity. We present sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results. sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests.Peer reviewe

    PrivFL: Practical Privacy-preserving Federated Regressions on High-dimensional Data over Mobile Networks

    Full text link
    Federated Learning (FL) enables a large number of users to jointly learn a shared machine learning (ML) model, coordinated by a centralized server, where the data is distributed across multiple devices. This approach enables the server or users to train and learn an ML model using gradient descent, while keeping all the training data on users' devices. We consider training an ML model over a mobile network where user dropout is a common phenomenon. Although federated learning was aimed at reducing data privacy risks, the ML model privacy has not received much attention. In this work, we present PrivFL, a privacy-preserving system for training (predictive) linear and logistic regression models and oblivious predictions in the federated setting, while guaranteeing data and model privacy as well as ensuring robustness to users dropping out in the network. We design two privacy-preserving protocols for training linear and logistic regression models based on an additive homomorphic encryption (HE) scheme and an aggregation protocol. Exploiting the training algorithm of federated learning, at the core of our training protocols is a secure multiparty global gradient computation on alive users' data. We analyze the security of our training protocols against semi-honest adversaries. As long as the aggregation protocol is secure under the aggregation privacy game and the additive HE scheme is semantically secure, PrivFL guarantees the users' data privacy against the server, and the server's regression model privacy against the users. We demonstrate the performance of PrivFL on real-world datasets and show its applicability in the federated learning system.Comment: In Proceedings of the 2019 ACM SIGSAC Conference on Cloud Computing Security Workshop (CCSW'19

    Supporting Regularized Logistic Regression Privately and Efficiently

    Full text link
    As one of the most popular statistical and machine learning models, logistic regression with regularization has found wide adoption in biomedicine, social sciences, information technology, and so on. These domains often involve data of human subjects that are contingent upon strict privacy regulations. Increasing concerns over data privacy make it more and more difficult to coordinate and conduct large-scale collaborative studies, which typically rely on cross-institution data sharing and joint analysis. Our work here focuses on safeguarding regularized logistic regression, a widely-used machine learning model in various disciplines while at the same time has not been investigated from a data security and privacy perspective. We consider a common use scenario of multi-institution collaborative studies, such as in the form of research consortia or networks as widely seen in genetics, epidemiology, social sciences, etc. To make our privacy-enhancing solution practical, we demonstrate a non-conventional and computationally efficient method leveraging distributing computing and strong cryptography to provide comprehensive protection over individual-level and summary data. Extensive empirical evaluation on several studies validated the privacy guarantees, efficiency and scalability of our proposal. We also discuss the practical implications of our solution for large-scale studies and applications from various disciplines, including genetic and biomedical studies, smart grid, network analysis, etc

    Protecting privacy of users in brain-computer interface applications

    Get PDF
    Machine learning (ML) is revolutionizing research and industry. Many ML applications rely on the use of large amounts of personal data for training and inference. Among the most intimate exploited data sources is electroencephalogram (EEG) data, a kind of data that is so rich with information that application developers can easily gain knowledge beyond the professed scope from unprotected EEG signals, including passwords, ATM PINs, and other intimate data. The challenge we address is how to engage in meaningful ML with EEG data while protecting the privacy of users. Hence, we propose cryptographic protocols based on secure multiparty computation (SMC) to perform linear regression over EEG signals from many users in a fully privacy-preserving(PP) fashion, i.e., such that each individual's EEG signals are not revealed to anyone else. To illustrate the potential of our secure framework, we show how it allows estimating the drowsiness of drivers from their EEG signals as would be possible in the unencrypted case, and at a very reasonable computational cost. Our solution is the first application of commodity-based SMC to EEG data, as well as the largest documented experiment of secret sharing-based SMC in general, namely, with 15 players involved in all the computations

    Online Distributed Learning over Random Networks

    Full text link
    The recent deployment of multi-agent systems in a wide range of scenarios has enabled the solution of learning problems in a distributed fashion. In this context, agents are tasked with collecting local data and then cooperatively train a model, without directly sharing the data. While distributed learning offers the advantage of preserving agents' privacy, it also poses several challenges in terms of designing and analyzing suitable algorithms. This work focuses specifically on the following challenges motivated by practical implementation: (i) online learning, where the local data change over time; (ii) asynchronous agent computations; (iii) unreliable and limited communications; and (iv) inexact local computations. To tackle these challenges, we introduce the Distributed Operator Theoretical (DOT) version of the Alternating Direction Method of Multipliers (ADMM), which we call the DOT-ADMM Algorithm. We prove that it converges with a linear rate for a large class of convex learning problems (e.g., linear and logistic regression problems) toward a bounded neighborhood of the optimal time-varying solution, and characterize how the neighborhood depends on~(i)–(iv)\text{(i)--(iv)}. We corroborate the theoretical analysis with numerical simulations comparing the DOT-ADMM Algorithm with other state-of-the-art algorithms, showing that only the proposed algorithm exhibits robustness to (i)--(iv)
    corecore