31 research outputs found

    Frequent Symptom Sets Identification from Uncertain Medical Data in Differentially Private Way

    Get PDF

    Frequent itemset mining in big data with effective single scan algorithms

    Get PDF
    © 2013 IEEE. This paper considers frequent itemsets mining in transactional databases. It introduces a new accurate single scan approach for frequent itemset mining (SSFIM), a heuristic as an alternative approach (EA-SSFIM), as well as a parallel implementation on Hadoop clusters (MR-SSFIM). EA-SSFIM and MR-SSFIM target sparse and big databases, respectively. The proposed approach (in all its variants) requires only one scan to extract the candidate itemsets, and it has the advantage to generate a fixed number of candidate itemsets independently from the value of the minimum support. This accelerates the scan process compared with existing approaches while dealing with sparse and big databases. Numerical results show that SSFIM outperforms the state-of-the-art FIM approaches while dealing with medium and large databases. Moreover, EA-SSFIM provides similar performance as SSFIM while considerably reducing the runtime for large databases. The results also reveal the superiority of MR-SSFIM compared with the existing HPC-based solutions for FIM using sparse and big databases

    Knowledge Discovery From Sensitive Data: Differentially Private Bayesian Rule Lists

    Get PDF
    The utility of machine learning is rising, coming from a growing wealth of data and problems that are becoming harder to solve analytically. With these changes there is also the need for interpretable machine learning in order for users to understand how a machine learning algorithm comes to a specific output. Bayesian Rule Lists, an interpretable machine learning algorithm, offers an advanced accuracy to interpretabilty trade off when compared to other interpretable machine learning algorithms. Additionally, with the amount of data collected today, there is a lot of potentially sensitive data that we can learn from such as medical and criminal records. However, to do so, we must guarantee a degree of privacy on the dataset; differential privacy has become the standard for this private data analysis. In this paper, we propose a differentially private algorithm for Bayesian Rule Lists. We first break down the original Bayesian Rule List algorithm into three main components: frequent itemset mining, rule list sampling, and point estimate computation. We then perform a literature review to understand these algorithms, and ways to privatize them. There after we computed the necessary sensitivities for all subroutines, and ran experiments on the resulting differentially private algorithm to gauge utility. Results show that the proposed algorithm is able to output rule lists with good accuracy and decent interpretability

    Privacy-preserving Frequent Itemset Mining for Sparse and Dense Data

    Get PDF
    Frequent itemset mining is a task that can in turn be used for other purposes such as associative rule mining. One problem is that the data may be sensitive, and its owner may refuse to give it for analysis in plaintext. There exist many privacy-preserving solutions for frequent itemset mining, but in any case enhancing the privacy inevitably spoils the efficiency. Leaking some less sensitive information such as data density might improve the efficiency. In this paper, we devise an approach that works better for sparse matrices and compare it to the related work that uses similar security requirements on similar secure multiparty computation platform

    Anonymizing large transaction data using MapReduce

    Get PDF
    Publishing transaction data is important to applications such as marketing research and biomedical studies. Privacy is a concern when publishing such data since they often contain person-specific sensitive information. To address this problem, different data anonymization methods have been proposed. These methods have focused on protecting the associated individuals from different types of privacy leaks as well as preserving utility of the original data. But all these methods are sequential and are designed to process data on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In this work, we consider how MapReduce may be used to provide scalability in large transaction data anonymization. More specifically, we consider how setbased generalization methods such as RBAT (Rule-Based Anonymization of Transaction data) may be parallelized using MapReduce. Set-based generalization methods have some desirable features for transaction anonymization, but their highly iterative nature makes parallelization challenging. RBAT is a good representative of such methods. We propose a method for transaction data partitioning and representation. We also present two MapReduce-based parallelizations of RBAT. Our methods ensure scalability when the number of transaction records and domain of items are large. Our preliminary results show that a direct parallelization of RBAT by partitioning data alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that generalizes our direct parallel method and allows to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and to the available resources while retaining good data utility

    Secure Protocols for Privacy-preserving Data Outsourcing, Integration, and Auditing

    Get PDF
    As the amount of data available from a wide range of domains has increased tremendously in recent years, the demand for data sharing and integration has also risen. The cloud computing paradigm provides great flexibility to data owners with respect to computation and storage capabilities, which makes it a suitable platform for them to share their data. Outsourcing person-specific data to the cloud, however, imposes serious concerns about the confidentiality of the outsourced data, the privacy of the individuals referenced in the data, as well as the confidentiality of the queries processed over the data. Data integration is another form of data sharing, where data owners jointly perform the integration process, and the resulting dataset is shared between them. Integrating related data from different sources enables individuals, businesses, organizations and government agencies to perform better data analysis, make better informed decisions, and provide better services. Designing distributed, secure, and privacy-preserving protocols for integrating person-specific data, however, poses several challenges, including how to prevent each party from inferring sensitive information about individuals during the execution of the protocol, how to guarantee an effective level of privacy on the released data while maintaining utility for data mining, and how to support public auditing such that anyone at any time can verify that the integration was executed correctly and no participants deviated from the protocol. In this thesis, we address the aforementioned concerns by presenting secure protocols for privacy-preserving data outsourcing, integration and auditing. First, we propose a secure cloud-based data outsourcing and query processing framework that simultaneously preserves the confidentiality of the data and the query requests, while providing differential privacy guarantees on the query results. Second, we propose a publicly verifiable protocol for integrating person-specific data from multiple data owners, while providing differential privacy guarantees and maintaining an effective level of utility on the released data for the purpose of data mining. Next, we propose a privacy-preserving multi-party protocol for high-dimensional data mashup with guaranteed LKC-privacy on the output data. Finally, we apply the theory to the real world problem of solvency in Bitcoin. More specifically, we propose a privacy-preserving and publicly verifiable cryptographic proof of solvency scheme for Bitcoin exchanges such that no information is revealed about the exchange's customer holdings, the value of the exchange's total holdings is kept secret, and multiple exchanges performing the same proof of solvency can contemporaneously prove they are not colluding
    corecore