921 research outputs found

    Anonymizing large transaction data using MapReduce

    Get PDF
    Publishing transaction data is important to applications such as marketing research and biomedical studies. Privacy is a concern when publishing such data since they often contain person-specific sensitive information. To address this problem, different data anonymization methods have been proposed. These methods have focused on protecting the associated individuals from different types of privacy leaks as well as preserving utility of the original data. But all these methods are sequential and are designed to process data on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In this work, we consider how MapReduce may be used to provide scalability in large transaction data anonymization. More specifically, we consider how setbased generalization methods such as RBAT (Rule-Based Anonymization of Transaction data) may be parallelized using MapReduce. Set-based generalization methods have some desirable features for transaction anonymization, but their highly iterative nature makes parallelization challenging. RBAT is a good representative of such methods. We propose a method for transaction data partitioning and representation. We also present two MapReduce-based parallelizations of RBAT. Our methods ensure scalability when the number of transaction records and domain of items are large. Our preliminary results show that a direct parallelization of RBAT by partitioning data alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that generalizes our direct parallel method and allows to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and to the available resources while retaining good data utility

    Scalable TPTDS Data Anonymization over Cloud using MapReduce

    Get PDF
    With the rapid advancement of big data digital age, large amount data is collected, mined and published. Data publishing become day today routine activity. Cloud computing is best suitable model to support big data applications. Large number of cloud service need users to share microdata like electronic health records, data containing financial transactions so that they can analyze this data. But one of the major issues in moving toward cloud is privacy threats. Data anonymization techniques are widely used to combat with privacy concerns .Anonymizing data sets using generalization to achieve k-anonymity is one of the privacy preserving techniques. Currently, the scale of data in many cloud applications is increasing massively in accordance with the Big Data tendency, thereby making it a difficult for commonly used software tools to capture, handle, manage and process such large-scale datasets. As a result it is challenge for existing approaches for achieving anonymization for large scale data sets due to their inefficiency to support scalability. This paper presents two phase top down specialization approach to anonymize large scale datasets .This approach uses MapReduce framework on cloud, so that it will be highly scalable and efficient. Here we introduce the scheduling mechanism called Optimized Balanced Scheduling to apply the Anonymization. OBS means individual dataset have the separate sensitive field. Every data set consist of sensitive field and give priority for this sensitive field. Then apply Anonymization on this sensitive field only depending upon the scheduling. DOI: 10.17762/ijritcc2321-8169.15077

    Data Anonymization for Privacy Preservation in Big Data

    Get PDF
    Cloud computing provides capable ascendable IT edifice to provision numerous processing of a various big data applications in sectors such as healthcare and business. Mainly electronic health records data sets and in such applications generally contain privacy-sensitive data. The most popular technique for data privacy preservation is anonymizing the data through generalization. Proposal is to examine the issue against proximity privacy breaches for big data anonymization and try to recognize a scalable solution to this issue. Scalable clustering approach with two phase consisting of clustering algorithm and K-Anonymity scheme with Generalisation and suppression is intended to work on this problem. Design of the algorithms is done with MapReduce to increase high scalability by carrying out dataparallel execution in cloud. Wide-ranging researches on actual data sets substantiate that the method deliberately advances the competence of defensive proximity privacy breaks, the scalability and the efficiency of anonymization over existing methods. Anonymizing data sets through generalization to gratify some of the privacy attributes like k- Anonymity is a popularly-used type of privacy preserving methods. Currently, the gauge of data in numerous cloud surges extremely in agreement with the Big Data, making it a dare for frequently used tools to actually get, manage, and process large-scale data for a particular accepted time scale. Hence, it is a trial for prevailing anonymization approaches to attain privacy conservation for big data private information due to scalabilty issues

    BIG DATA ANALYTICS - AN OVERVIEW

    Get PDF
       Big Data Analytics has been in advance more attention recently since researchers in business and academic world are trying to successfully mine and use all possible knowledge from the vast amount of data generated and obtained. Demanding a paradigm shift in the storage, processing and analysis of Big Data, traditional data analysis methods stumble upon large amounts of data in a short period of time. Because of its importance, the U.S. Many agencies, including the government, have in recent years released large funds for research in Big Data and related fields. This gives a concise summary of investigate growth in various areas related to big data processing and analysis and terminate with a discussion of research guidelines in the similar areas. &nbsp

    Local and global recoding methods for anonymizing set-valued data

    Get PDF
    In this paper, we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of supermarket transactions that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the knowledge of the adversary. We define a new version of the k-anonymity guarantee, the k m-anonymity, to limit the effects of the data dimensionality, and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm that finds the optimal solution, however, at a high cost that makes it inapplicable for large, realistic problems. Then, we propose a greedy heuristic, which performs generalizations in an Apriori, level-wise fashion. The heuristic scales much better and in most of the cases finds a solution close to the optimal. Finally, we investigate the application of techniques that partition the database and perform anonymization locally, aiming at the reduction of the memory consumption and further scalability. A thorough experimental evaluation with real datasets shows that a vertical partitioning approach achieves excellent results in practice. © 2010 Springer-Verlag.postprin
    • …
    corecore