26,465 research outputs found
Privacy Preserving Data Mining For Horizontally Distributed Medical Data Analysis
To build reliable prediction models and identify useful patterns, assembling data sets from databases maintained by different sources such as hospitals becomes increasingly common; however, it might divulge sensitive information about individuals and thus leads to increased concerns about privacy, which in turn prevents different parties from sharing information. Privacy Preserving Distributed Data Mining (PPDDM) provides a means to address this issue without accessing actual data values to avoid the disclosure of information beyond the final result. In recent years, a number of state-of-the-art PPDDM approaches have been developed, most of which are based on Secure Multiparty Computation (SMC). SMC requires expensive communication cost and sophisticated secure computation. Besides, the mining progress is inevitable to slow down due to the increasing volume of the aggregated data. In this work, a new framework named Privacy-Aware Non-linear SVM (PAN-SVM) is proposed to build a PPDDM model from multiple data sources. PAN-SVM employs the Secure Sum Protocol to protect privacy at the bottom layer, and reduces the complex communication and computation via Nystrom matrix approximation and Eigen decomposition methods at the medium layer. The top layer of PAN-SVM speeds up the whole algorithm for large scale datasets. Based on the proposed framework of PAN-SVM, a Privacy Preserving Multi-class Classifier is built, and the experimental results on several benchmark datasets and microarray datasets show its abilities to improve classification accuracy compared with a regular SVM. In addition, two Privacy Preserving Feature Selection methods are also proposed based on PAN-SVM, and tested by using benchmark data and real world data. PAN-SVM does not depend on a trusted third party; all participants collaborate equally. Many experimental results show that PAN-SVM can not only effectively solve the problem of collaborative privacy-preserving data mining by building non-linear classification rules, but also significantly improve the performance of built classifiers
A Data Mining Perspective in Privacy Preserving Data Mining Systems
Privacy Preserving Data Mining () presents a novel framework for extracting and deriving information when the data is distributed amongst the multiple parties. The privacy preservation of data and the use of efficient data mining algorithms in systems is a major issue that exists. Most of the existing systems employ the cryptographic key exchange process and the key computation process accomplished by means of certain trusted server or a third party. To eliminate the key exchange and key computation overheads this paper discusses the Key Distribution-Less Privacy Preserving Data Mining () system. The novelty of the system is that no data is published but only the association rules are published to achieve effective data mining results. The embodies the data mining algorithm for classification rule generation and data mining. The results discussed in this paper compare the based system with the based system and the efficiency in rule generation, overhead reduction and classification efficiency of the latter is proved
Privacy-Preserving Generalized Linear Models using Distributed Block Coordinate Descent
Combining data from varied sources has considerable potential for knowledge
discovery: collaborating data parties can mine data in an expanded feature
space, allowing them to explore a larger range of scientific questions.
However, data sharing among different parties is highly restricted by legal
conditions, ethical concerns, and / or data volume. Fueled by these concerns,
the fields of cryptography and distributed learning have made great progress
towards privacy-preserving and distributed data mining. However, practical
implementations have been hampered by the limited scope or computational
complexity of these methods. In this paper, we greatly extend the range of
analyses available for vertically partitioned data, i.e., data collected by
separate parties with different features on the same subjects. To this end, we
present a novel approach for privacy-preserving generalized linear models, a
fundamental and powerful framework underlying many prediction and
classification procedures. We base our method on a distributed block coordinate
descent algorithm to obtain parameter estimates, and we develop an extension to
compute accurate standard errors without additional communication cost. We
critically evaluate the information transfer for semi-honest collaborators and
show that our protocol is secure against data reconstruction. Through both
simulated and real-world examples we illustrate the functionality of our
proposed algorithm. Without leaking information, our method performs as well on
vertically partitioned data as existing methods on combined data -- all within
mere minutes of computation time. We conclude that our method is a viable
approach for vertically partitioned data analysis with a wide range of
real-world applications.Comment: Fully reproducible code for all results and images can be found at
https://github.com/vankesteren/privacy-preserving-glm, and the software
package can be found at https://github.com/vankesteren/privre
Privacy-preserving scoring of tree ensembles : a novel framework for AI in healthcare
Machine Learning (ML) techniques now impact a wide variety of domains. Highly regulated industries such as healthcare and finance have stringent compliance and data governance policies around data sharing. Advances in secure multiparty computation (SMC) for privacy-preserving machine learning (PPML) can help transform these regulated industries by allowing ML computations over encrypted data with personally identifiable information (PII). Yet very little of SMC-based PPML has been put into practice so far. In this paper we present the very first framework for privacy-preserving classification of tree ensembles with application in healthcare. We first describe the underlying cryptographic protocols that enable a healthcare organization to send encrypted data securely to a ML scoring service and obtain encrypted class labels without the scoring service actually seeing that input in the clear. We then describe the deployment challenges we solved to integrate these protocols in a cloud based scalable risk-prediction platform with multiple ML models for healthcare AI. Included are system internals, and evaluations of our deployment for supporting physicians to drive better clinical outcomes in an accurate, scalable, and provably secure manner. To the best of our knowledge, this is the first such applied framework with SMC-based privacy-preserving machine learning for healthcare
- …