2,432 research outputs found

    Improved k-Anonymize and l-Diverse Approach for Privacy Preserving Big Data Publishing Using MPSEC Dataset

    Get PDF
    Data exposure and privacy violations may happen when data is exchanged between organizations. Data anonymization gives promising results for limiting such dangers. In order to maintain privacy, different methods of k-anonymization and l-diversity have been widely used. But for larger datasets, the results are not very promising. The main problem with existing anonymization algorithms is high information loss and high running time. To overcome this problem, this paper proposes new models, namely Improved k-Anonymization (IKA) and Improved l-Diversity (ILD). IKA model takes large k-value using a symmetric as well as an asymmetric anonymizing algorithm. Then IKA is further categorized into Improved Symmetric k-Anonymization (ISKA) and Improved Asymmetric k-Anonymization (IAKA). After anonymizing data using IKA, ILD model is used to increase privacy. ILD will make the data more diverse and thereby increasing privacy. This paper presents the implementation of the proposed IKA and ILD model using real-time big candidate election dataset, which is acquired from the Madhya Pradesh State Election Commission, India (MPSEC) along with Apache Storm. This paper also compares the proposed model with existing algorithms, i.e. Fast clustering-based Anonymization for Data Streams (FADS), Fast Anonymization for Data Stream (FAST), Map Reduce Anonymization (MRA) and Scalable k-Anonymization (SKA). The experimental results show that the proposed models IKA and ILD have remarkable improvement of information loss and significantly enhanced the performance in terms of running time over the existing approaches along with maintaining the privacy-utility trade-off

    Differentially Private Publication of Sparse Data

    Full text link
    The problem of privately releasing data is to provide a version of a dataset without revealing sensitive information about the individuals who contribute to the data. The model of differential privacy allows such private release while providing strong guarantees on the output. A basic mechanism achieves differential privacy by adding noise to the frequency counts in the contingency tables (or, a subset of the count data cube) derived from the dataset. However, when the dataset is sparse in its underlying space, as is the case for most multi-attribute relations, then the effect of adding noise is to vastly increase the size of the published data: it implicitly creates a huge number of dummy data points to mask the true data, making it almost impossible to work with. We present techniques to overcome this roadblock and allow efficient private release of sparse data, while maintaining the guarantees of differential privacy. Our approach is to release a compact summary of the noisy data. Generating the noisy data and then summarizing it would still be very costly, so we show how to shortcut this step, and instead directly generate the summary from the input data, without materializing the vast intermediate noisy data. We instantiate this outline for a variety of sampling and filtering methods, and show how to use the resulting summary for approximate, private, query answering. Our experimental study shows that this is an effective, practical solution, with comparable and occasionally improved utility over the costly materialization approach