113 research outputs found

    Avoiding disclosure of individually identifiable health information: a literature review

    Get PDF
    Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

    Artificial intelligence: the end of legal protection of personal data and intellectual property? : research on the countering effects of data protection and IPR on the regulation of artificial intelligence systems

    Get PDF
    Artificial Intelligence systems have gained notoriety for changing (and having a great potential) to further change the way we live. The use of AI impacts the rights and freedoms of natural persons necessitating the revision of various laws relevant to AI. This research considers the intersection of data protection and intellectual property law as it impacts the rights and freedoms of natural persons. This research argues that data protection and intellectual property law interrelate in such a manner that the (non) regulation of one legal field might (negatively) impact the other. This research examines some of these issues, (including data reidentification) and further proposes the redefinition of the concept of personal data as a means of ensuring that the application of data protection and intellectual property law to AI does not limit the development, adoption, and use of AI

    Archives and Records

    Get PDF
    This open access book addresses the protection of privacy and personality rights in public records, records management, historical sources, and archives; and historical and current access to them in a broad international comparative perspective. Considering the question “can archiving pose a security risk to the protection of sensitive data and human rights?”, it analyses data security and presents several significant cases of the misuse of sensitive personal data, such as census data or medical records. It examines archival inflation and the minimisation and reduction of data in public records and archives, including data anonymisation and pseudonymisation, and the risks of deanonymisation and reidentification of persons. The book looks at post-mortem privacy protection, the relationship of the right to know and the right to be forgotten and introduces a specific model of four categories of the right to be forgotten. In its conclusion, the book presents a set of recommendations for archives and records management

    Data Summarizations for Scalable, Robust and Privacy-Aware Learning in High Dimensions

    Get PDF
    The advent of large-scale datasets has offered unprecedented amounts of information for building statistically powerful machines, but, at the same time, also introduced a remarkable computational challenge: how can we efficiently process massive data? This thesis presents a suite of data reduction methods that make learning algorithms scale on large datasets, via extracting a succinct model-specific representation that summarizes the full data collection—a coreset. Our frameworks support by design datasets of arbitrary dimensionality, and can be used for general purpose Bayesian inference under real-world constraints, including privacy preservation and robustness to outliers, encompassing diverse uncertainty-aware data analysis tasks, such as density estimation, classification and regression. We motivate the necessity for novel data reduction techniques in the first place by developing a reidentification attack on coarsened representations of private behavioural data. Analysing longitudinal records of human mobility, we detect privacy-revealing structural patterns, that remain preserved in reduced graph representations of individuals’ information with manageable size. These unique patterns enable mounting linkage attacks via structural similarity computations on longitudinal mobility traces, revealing an overlooked, yet existing, privacy threat. We then propose a scalable variational inference scheme for approximating posteriors on large datasets via learnable weighted pseudodata, termed pseudocoresets. We show that the use of pseudodata enables overcoming the constraints on minimum summary size for given approximation quality, that are imposed on all existing Bayesian coreset constructions due to data dimensionality. Moreover, it allows us to develop a scheme for pseudocoresets-based summarization that satisfies the standard framework of differential privacy by construction; in this way, we can release reduced size privacy-preserving representations for sensitive datasets that are amenable to arbitrary post-processing. Subsequently, we consider summarizations for large-scale Bayesian inference in scenarios when observed datapoints depart from the statistical assumptions of our model. Using robust divergences, we develop a method for constructing coresets resilient to model misspecification. Crucially, this method is able to automatically discard outliers from the generated data summaries. Thus we deliver robustified scalable representations for inference, that are suitable for applications involving contaminated and unreliable data sources. We demonstrate the performance of proposed summarization techniques on multiple parametric statistical models, and diverse simulated and real-world datasets, from music genre features to hospital readmission records, considering a wide range of data dimensionalities.Nokia Bell Labs, Lundgren Fund, Darwin College, University of Cambridge Department of Computer Science & Technology, University of Cambridg
    • …
    corecore