997 research outputs found

    SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication

    Get PDF
    Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations.Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade-off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade-off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade-off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy

    SoK: Chasing Accuracy and Privacy, and Catching Both in Differentially Private Histogram Publication

    Get PDF
    Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations. Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature regarding providing an understanding of the trade off and how to address it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we identify trends and connections in the contributions to the field of differential privacy for histograms and synthetic data and 2) we provide an understanding of the privacy/accuracy trade off challenge by crystallizing different dimensions to accuracy improvement. Accordingly, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of accuracy improvement each technique/approach is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy

    Differential Privacy - A Balancing Act

    Get PDF
    Data privacy is an ever important aspect of data analyses. Historically, a plethora of privacy techniques have been introduced to protect data, but few have stood the test of time. From investigating the overlap between big data research, and security and privacy research, I have found that differential privacy presents itself as a promising defender of data privacy.Differential privacy is a rigorous, mathematical notion of privacy. Nevertheless, privacy comes at a cost. In order to achieve differential privacy, we need to introduce some form of inaccuracy (i.e. error) to our analyses. Hence, practitioners need to engage in a balancing act between accuracy and privacy when adopting differential privacy. As a consequence, understanding this accuracy/privacy trade-off is vital to being able to use differential privacy in real data analyses.In this thesis, I aim to bridge the gap between differential privacy in theory, and differential privacy in practice. Most notably, I aim to convey a better understanding of the accuracy/privacy trade-off, by 1) implementing tools to tweak accuracy/privacy in a real use case, 2) presenting a methodology for empirically predicting error, and 3) systematizing and analyzing known accuracy improvement techniques for differentially private algorithms. Additionally, I also put differential privacy into context by investigating how it can be applied in the automotive domain. Using the automotive domain as an example, I introduce the main challenges that constitutes the balancing act, and provide advice for moving forward

    AutoLog: A Log Sequence Synthesis Framework for Anomaly Detection

    Full text link
    The rapid progress of modern computing systems has led to a growing interest in informative run-time logs. Various log-based anomaly detection techniques have been proposed to ensure software reliability. However, their implementation in the industry has been limited due to the lack of high-quality public log resources as training datasets. While some log datasets are available for anomaly detection, they suffer from limitations in (1) comprehensiveness of log events; (2) scalability over diverse systems; and (3) flexibility of log utility. To address these limitations, we propose AutoLog, the first automated log generation methodology for anomaly detection. AutoLog uses program analysis to generate run-time log sequences without actually running the system. AutoLog starts with probing comprehensive logging statements associated with the call graphs of an application. Then, it constructs execution graphs for each method after pruning the call graphs to find log-related execution paths in a scalable manner. Finally, AutoLog propagates the anomaly label to each acquired execution path based on human knowledge. It generates flexible log sequences by walking along the log execution paths with controllable parameters. Experiments on 50 popular Java projects show that AutoLog acquires significantly more (9x-58x) log events than existing log datasets from the same system, and generates log messages much faster (15x) with a single machine than existing passive data collection approaches. We hope AutoLog can facilitate the benchmarking and adoption of automated log analysis techniques.Comment: The paper has been accepted by ASE 2023 (Research Track

    Privacy-preserving social network analysis

    Get PDF
    Data privacy in social networks is a growing concern that threatens to limit access to important information contained in these data structures. Analysis of the graph structure of social networks can provide valuable information for revenue generation and social science research, but unfortunately, ensuring this analysis does not violate individual privacy is difficult. Simply removing obvious identifiers from graphs or even releasing only aggregate results of analysis may not provide sufficient protection. Differential privacy is an alternative privacy model, popular in data-mining over tabular data, that uses noise to obscure individuals\u27 contributions to aggregate results and offers a strong mathematical guarantee that individuals\u27 presence in the data-set is hidden. Analyses that were previously vulnerable to identification of individuals and extraction of private data may be safely released under differential-privacy guarantees. However, existing adaptations of differential privacy to social network analysis are often complex and have considerable impact on the utility of the results, making it less likely that they will see widespread adoption in the social network analysis world. In fact, social scientists still often use the weakest form of privacy protection, simple anonymization, in their social network analysis publications. ^ We review the existing work in graph-privatization, including the two existing standards for adapting differential privacy to network data. We then proposecontributor-privacy and partition-privacy , novel standards for differential privacy over network data, and introduce simple, powerful private algorithms using these standards for common network analysis techniques that were infeasible to privatize under previous differential privacy standards. We also ensure that privatized social network analysis does not violate the level of rigor required in social science research, by proposing a method of determining statistical significance for paired samples under differential privacy using the Wilcoxon Signed-Rank Test, which is appropriate for non-normally distributed data. ^ Finally, we return to formally consider the case where differential privacy is not applied to data. Naive, deterministic approaches to privacy protection, including anonymization and aggregation of data, are often used in real world practice. De-anonymization research demonstrates that some naive approaches to privacy are highly vulnerable to reidentification attacks, and none of these approaches offer the robust guarantee of differential privacy. However, we propose that these methods fall across a range of protection: Some are better than others. In cases where adding noise to data is especially problematic, or acceptance and adoption of differential privacy is especially slow, it is critical to have a formal understanding of the alternatives. We define De Facto Privacy, a metric for comparing the relative privacy protection provided by deterministic approaches

    LDP-IDS: Local Differential Privacy for Infinite Data Streams

    Get PDF
    Streaming data collection is essential to real-time data analytics in various IoTs and mobile device-based systems, which, however, may expose end users' privacy. Local differential privacy (LDP) is a promising solution to privacy-preserving data collection and analysis. However, existing few LDP studies over streams are either applicable to finite streams only or suffering from insufficient protection. This paper investigates this problem by proposing LDP-IDS, a novel ww-event LDP paradigm to provide practical privacy guarantee for infinite streams at users end, and adapting the popular budget division framework in centralized differential privacy (CDP). By constructing a unified error analysi for LDP, we first develop two adatpive budget division-based LDP methods for LDP-IDS that can enhance data utility via leveraging the non-deterministic sparsity in streams. Beyond that, we further propose a novel population division framework that can not only avoid the high sensitivity of LDP noise to budget division but also require significantly less communication. Based on the framework, we also present two adaptive population division methods for LDP-IDS with theoretical analysis. We conduct extensive experiments on synthetic and real-world datasets to evaluate the effectiveness and efficiency pf our proposed frameworks and methods. Experimental results demonstrate that, despite the effectiveness of the adaptive budget division methods, the proposed population division framework and methods can further achieve much higher effectiveness and efficiency.Comment: accepted to SIGMOD'2

    Emerging model spedies driven by transciptomics

    Get PDF
    This work is focused on 'emerging model species', i.e. question-driven model species which have sufficient molecular resources to investigate a specific phenomenon in molecular biology, developmental biology, molecular ecology and evolution or related molecular fields. This thesis shows how transcriptomic data can be generated, analyzed, and used to investigate such phenomena of interest even in species lacking a reference genome. The initial ButterflyBase resource has proven to be useful to researchers of species without a reference genome but is limited to the Lepidoptera and supports only the older Sanger sequencing technologies. Thanks to Next Generation Sequencing, transcriptome sequencing is more cost effective but the bottleneck of transcriptomic projects is now the bioinformatic analysis and data mining/dissemination. Therefore, this work continues with presenting novel and innovative approaches which effectively overcome this bottleneck. The est2assembly software produces deeply annotated reference transcriptomes stored in the Chado database. The Drupal Bioinformatic Server Framework and genes4all provide species-neutral and an innovative approach in building standardized online databases and associated web services. All public insect mRNA data were analyzed with est2assembly and genes4all to produce the InsectaCentral. With InsectaCentral, a powerful resource is now available to assist molecular biology in any question-driven model insect species. The software presented here was developed according to specifications of the General Model Organism Database (GMOD) community. All software specifications are species-neutral and can be seamlessly deployed to assist any research community. Further through a case studies chapter, it becomes apparent that the transcriptomic approach is more cost-effective than a genomic approach and therefore sequence-driven evolutionary biology will benefit faster with this field
    • …
    corecore