1,297 research outputs found

    Privacy Preserving Utility Mining: A Survey

    Full text link
    In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page

    Towards trajectory anonymization: a generalization-based approach

    Get PDF
    Trajectory datasets are becoming popular due to the massive usage of GPS and locationbased services. In this paper, we address privacy issues regarding the identification of individuals in static trajectory datasets. We first adopt the notion of k-anonymity to trajectories and propose a novel generalization-based approach for anonymization of trajectories. We further show that releasing anonymized trajectories may still have some privacy leaks. Therefore we propose a randomization based reconstruction algorithm for releasing anonymized trajectory data and also present how the underlying techniques can be adapted to other anonymity standards. The experimental results on real and synthetic trajectory datasets show the effectiveness of the proposed techniques

    A data recipient centered de-identification method to retain statistical attributes

    Get PDF
    AbstractPrivacy has always been a great concern of patients and medical service providers. As a result of the recent advances in information technology and the government’s push for the use of Electronic Health Record (EHR) systems, a large amount of medical data is collected and stored electronically. This data needs to be made available for analysis but at the same time patient privacy has to be protected through de-identification. Although biomedical researchers often describe their research plans when they request anonymized data, most existing anonymization methods do not use this information when de-identifying the data. As a result, the anonymized data may not be useful for the planned research project. This paper proposes a data recipient centered approach to tailor the de-identification method based on input from the recipient of the data. We demonstrate our approach through an anonymization project for biomedical researchers with specific goals to improve the utility of the anonymized data for statistical models used for their research project. The selected algorithm improves a privacy protection method called Condensation by Aggarwal et al. Our methods were tested and validated on real cancer surveillance data provided by the Kentucky Cancer Registry

    Towards trajectory anonymization: A generalization-based approach

    Get PDF
    Trajectory datasets are becoming,popular,due,to the massive,usage,of GPS and,location- based services. In this paper, we address privacy issues regarding the identification of individuals in static trajectory datasets. We first adopt the notion of k-anonymity,to trajectories and propose,a novel generalization-based approach,for anonymization,of trajectories. We further show,that releasing anonymized,trajectories may,still have,some,privacy,leaks. Therefore we propose,a randomization based,reconstruction,algorithm,for releasing anonymized,trajectory data and,also present how,the underlying,techniques,can be adapted,to other anonymity,standards. The experimental,results on real and,synthetic trajectory datasets show,the effectiveness of the proposed,techniques

    Privacy Preserving Mining in Code Profiling Data

    Get PDF
    Privacy preserving data mining is an important issue nowadays for data mining. Since various organizations and people are generating sensitive data or information these days. They don’t want to share their sensitive data however that data can be useful for data mining purpose. So, due to privacy preserving mining that data can be mined usefully without harming the privacy of that data. Privacy can be preserved by applying encryption on database which is to be mined because now the data is secure due to encryption. Code profiling is a field in software engineering where we can apply data mining to discover some knowledge so that it will be useful in future development of software. In this work we have applied privacy preserving mining in code profiling data such as software metrics of various codes. Results of data mining on actual and encrypted data are compared for accuracy. We have also analyzed the results of privacy preserving mining in code profiling data and found interesting results

    Anonymization procedures for tabular data: an explanatory technical and legal synthesis

    Get PDF
    In the European Union, Data Controllers and Data Processors, who work with personal data, have to comply with the General Data Protection Regulation and other applicable laws. This affects the storing and processing of personal data. But some data processing in data mining or statistical analyses does not require any personal reference to the data. Thus, personal context can be removed. For these use cases, to comply with applicable laws, any existing personal information has to be removed by applying the so-called anonymization. However, anonymization should maintain data utility. Therefore, the concept of anonymization is a double-edged sword with an intrinsic trade-off: privacy enforcement vs. utility preservation. The former might not be entirely guaranteed when anonymized data are published as Open Data. In theory and practice, there exist diverse approaches to conduct and score anonymization. This explanatory synthesis discusses the technical perspectives on the anonymization of tabular data with a special emphasis on the European Union’s legal base. The studied methods for conducting anonymization, and scoring the anonymization procedure and the resulting anonymity are explained in unifying terminology. The examined methods and scores cover both categorical and numerical data. The examined scores involve data utility, information preservation, and privacy models. In practice-relevant examples, methods and scores are experimentally tested on records from the UCI Machine Learning Repository’s “Census Income (Adult)” dataset

    Sanitizing and Minimizing Databases for Software Application Test Outsourcing

    Full text link
    Abstract—Testing software applications that use nontrivial databases is increasingly outsourced to test centers in order to achieve lower cost and higher quality. Not only do different data privacy laws prevent organizations from sharing this data with test centers because databases contain sensitive information, but also this situation is aggravated by big data – it is time consum-ing and difficult to anonymize, distribute, and test with large databases. Deleting data randomly often leads to significantly worsened test coverages and fewer uncovered faults, thereby reducing the quality of software applications. We propose a novel approach for Protecting and mInimizing databases for Software TestIng taSks (PISTIS) that both sanitizes and minimizes a database that comes along with an application. PISTIS uses a weight-based data clustering algorithm that partitions data in the database using information obtained using program analysis that describes how this data is used by the application. For each cluster, a centroid object is computed that represents different persons or entities in the cluster, and we use associative rule mining to compute and use constraints to ensure that the centroid objects are representative of the general population of the data in the cluster. Doing so also sanitizes information, since these centroid objects replace the original data to make it difficult for attackers to infer sensitive information. Thus, we reduce a large database to a few centroid objects and we show in our experiments with two applications that test coverage stays within a close range to its original level. I
    • …
    corecore