27 research outputs found

    Hybrid Anonymization Technique For Improving The Privacy In Network Data

    Get PDF
    There has been a considerable research over the last decades on methods for limiting disclosure in data publishing, especially for the last twenty years in the computer science field. Researchers have studied the problems of publishing microdata or network data without revealing any sensitive information that may have cause the paradigm preservation of information privacy. There are organizations that would like to publish their data for research, advertisement or prediction purposes. Nevertheless, they had the problems in information loss and lack of privacy. Hence, there are a few techniques and research that have been in highlights like the K-anonymity, l-diversity, generalization, clustering and randomization techniques, but most of these techniques is not comprehensive and the chances to lose the information is still high and may cause privacy leakage on the original data. The contribution of this research is the hybrid technique in anonymization process that will improve the protection and the privacy of data. With this better and comprehensive solution, it will decrease the loss of information. There are four major phases in this methodology as research guidance. The first phase is an overview of the entire research process and the second phase is the description of the anonymization process and techniques. It will be followed by the third phase of describing the design and module of the system, and the fourth phase is the researcher highlights on the comparison methods that are designed in this study. The researcher stated that there are two main contributions in this research. The first contribution is to introduce a new technique to anonymize the network data using the hybrid technique; and for the second contribution, the researcher creates a profile of a hybrid anonymization technique based on K-anonymity, l-diversity, generalization, clustering and randomization techniques. It is quite difficult to identify the best technique of anonymization process. Due to this, the researcher provides the details of analyzing, summarizing and profiling of the anonymization techniques. The researcher realizes that there are a few opportunities to advance this research within this domain in the near future, such as implementing a real-time based in anonymization process. Unfortunately, this type of processing needs to be revamped from the architectural design until the data processing part; and it is more thought-provoking if it were implemented in a real-time based or in the batch processing process, if the variable of the optimization is to be used in the anonymization process. Apart from that, the profiling of the anonymization processing techniques will also help the researcher to propose a generalization technique that might be implemented to anonymize data either using the micro or the network data

    The Challenges of Effectively Anonymizing Network Data

    Get PDF
    The availability of realistic network data plays a significant role in fostering collaboration and ensuring U.S. technical leadership in network security research. Unfortunately, a host of technical, legal, policy, and privacy issues limit the ability of operators to produce datasets for information security testing. In an effort to help overcome these limitations, several data collection efforts (e.g., CRAWDAD[14], PREDICT [34]) have been established in the past few years. The key principle used in all of these efforts to assure low-risk, high-value data is that of trace anonymization—the process of sanitizing data before release so that potentially sensitive information cannot be extracted

    Disclosure Risk from Homogeneity Attack in Differentially Private Frequency Distribution

    Full text link
    Differential privacy (DP) provides a robust model to achieve privacy guarantees for released information. We examine the protection potency of sanitized multi-dimensional frequency distributions via DP randomization mechanisms against homogeneity attack (HA). HA allows adversaries to obtain the exact values on sensitive attributes for their targets without having to identify them from the released data. We propose measures for disclosure risk from HA and derive closed-form relationships between the privacy loss parameters in DP and the disclosure risk from HA. The availability of the closed-form relationships assists understanding the abstract concepts of DP and privacy loss parameters by putting them in the context of a concrete privacy attack and offers a perspective for choosing privacy loss parameters when employing DP mechanisms in information sanitization and release in practice. We apply the closed-form mathematical relationships in real-life datasets to demonstrate the assessment of disclosure risk due to HA on differentially private sanitized frequency distributions at various privacy loss parameters

    A linear reconstruction approach for attribute inference attacks against synthetic data

    Get PDF
    Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective

    A linear reconstruction approach for attribute inference attacks against synthetic data

    Get PDF
    Recent advances in synthetic data generation (SDG) have been hailed as a solution to the difficult problem of sharing sensitive data while protecting privacy. SDG aims to learn statistical properties of real data in order to generate "artificial" data that are structurally and statistically similar to sensitive data. However, prior research suggests that inference attacks on synthetic data can undermine privacy, but only for specific outlier records. In this work, we introduce a new attribute inference attack against synthetic data. The attack is based on linear reconstruction methods for aggregate statistics, which target all records in the dataset, not only outliers. We evaluate our attack on state-of-the-art SDG algorithms, including Probabilistic Graphical Models, Generative Adversarial Networks, and recent differentially private SDG mechanisms. By defining a formal privacy game, we show that our attack can be highly accurate even on arbitrary records, and that this is the result of individual information leakage (as opposed to population-level inference). We then systematically evaluate the tradeoff between protecting privacy and preserving statistical utility. Our findings suggest that current SDG methods cannot consistently provide sufficient privacy protection against inference attacks while retaining reasonable utility. The best method evaluated, a differentially private SDG mechanism, can provide both protection against inference attacks and reasonable utility, but only in very specific settings. Lastly, we show that releasing a larger number of synthetic records can improve utility but at the cost of making attacks far more effective

    Differential Privacy - A Balancing Act

    Get PDF
    Data privacy is an ever important aspect of data analyses. Historically, a plethora of privacy techniques have been introduced to protect data, but few have stood the test of time. From investigating the overlap between big data research, and security and privacy research, I have found that differential privacy presents itself as a promising defender of data privacy.Differential privacy is a rigorous, mathematical notion of privacy. Nevertheless, privacy comes at a cost. In order to achieve differential privacy, we need to introduce some form of inaccuracy (i.e. error) to our analyses. Hence, practitioners need to engage in a balancing act between accuracy and privacy when adopting differential privacy. As a consequence, understanding this accuracy/privacy trade-off is vital to being able to use differential privacy in real data analyses.In this thesis, I aim to bridge the gap between differential privacy in theory, and differential privacy in practice. Most notably, I aim to convey a better understanding of the accuracy/privacy trade-off, by 1) implementing tools to tweak accuracy/privacy in a real use case, 2) presenting a methodology for empirically predicting error, and 3) systematizing and analyzing known accuracy improvement techniques for differentially private algorithms. Additionally, I also put differential privacy into context by investigating how it can be applied in the automotive domain. Using the automotive domain as an example, I introduce the main challenges that constitutes the balancing act, and provide advice for moving forward

    Prediction, evolution and privacy in social and affiliation networks

    Get PDF
    In the last few years, there has been a growing interest in studying online social and affiliation networks, leading to a new category of inference problems that consider the actor characteristics and their social environments. These problems have a variety of applications, from creating more effective marketing campaigns to designing better personalized services. Predictive statistical models allow learning hidden information automatically in these networks but also bring many privacy concerns. Three of the main challenges that I address in my thesis are understanding 1) how the complex observed and unobserved relationships among actors can help in building better behavior models, and in designing more accurate predictive algorithms, 2) what are the processes that drive the network growth and link formation, and 3) what are the implications of predictive algorithms to the privacy of users who share content online. The majority of previous work in prediction, evolution and privacy in online social networks has concentrated on the single-mode networks which form around user-user links, such as friendship and email communication. However, single-mode networks often co-exist with two-mode affiliation networks in which users are linked to other entities, such as social groups, online content and events. We study the interplay between these two types of networks and show that analyzing these higher-order interactions can reveal dependencies that are difficult to extract from the pair-wise interactions alone. In particular, we present our contributions to the challenging problems of collective classification, link prediction, network evolution, anonymization and preserving privacy in social and affiliation networks. We evaluate our models on real-world data sets from well-known online social networks, such as Flickr, Facebook, Dogster and LiveJournal

    Privacy Enhancing Technologies for solving the privacy-personalization paradox : taxonomy and survey

    Get PDF
    Personal data are often collected and processed in a decentralized fashion, within different contexts. For instance, with the emergence of distributed applications, several providers are usually correlating their records, and providing personalized services to their clients. Collected data include geographical and indoor positions of users, their movement patterns as well as sensor-acquired data that may reveal users’ physical conditions, habits and interests. Consequently, this may lead to undesired consequences such as unsolicited advertisement and even to discrimination and stalking. To mitigate privacy threats, several techniques emerged, referred to as Privacy Enhancing Technologies, PETs for short. On one hand, the increasing pressure on service providers to protect users’ privacy resulted in PETs being adopted. One the other hand, service providers have built their business model on personalized services, e.g. targeted ads and news. The objective of the paper is then to identify which of the PETs have the potential to satisfy both usually divergent - economical and ethical - purposes. This paper identifies a taxonomy classifying eight categories of PETs into three groups, and for better clarity, it considers three categories of personalized services. After defining and presenting the main features of PETs with illustrative examples, the paper points out which PETs best fit each personalized service category. Then, it discusses some of the inter-disciplinary privacy challenges that may slow down the adoption of these techniques, namely: technical, social, legal and economic concerns. Finally, it provides recommendations and highlights several research directions

    Toward Privacy in High-Dimensional Data Publishing

    Get PDF
    Nowadays data sharing among multiple parties has become inevitable in various application domains for diverse reasons, such as decision support, policy development and data mining. Yet, data in its raw format often contains person-specific sensitive information, and publishing such data without proper protection may jeopardize individual privacy. This fact has spawned extensive research on privacy-preserving data publishing (PPDP), which balances the fundamental trade-off between individual privacy and the utility of published data. Early research of PPDP focuses on protecting private and sensitive information in relational and statistical data. However, the recent prevalence of several emerging types of high-dimensional data has rendered unique challenges that prevent traditional PPDP techniques from being directly used. In this thesis, we address the privacy concerns in publishing four types of high-dimensional data, namely set-valued data, trajectory data, sequential data and network data. We develop effective and efficient non-interactive data publishing solutions for various utility requirements. Most of our solutions satisfy a rigorous privacy guarantee known as differential privacy, which has been the de facto standard for privacy protection. This thesis demonstrates that our solutions have exhibited great promise for releasing useful high-dimensional data without endangering individual privacy