7 research outputs found

    Spectral anonymization of data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 87-96).Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.by Thomas Anton Lasko.Ph.D

    Routinely collected data for randomized trials: promises, barriers, and implications

    Get PDF
    This work was supported by Stiftung Institut für klinische Epidemiologie. The Meta-Research Innovation Center at Stanford University is funded by a grant from the Laura and John Arnold Foundation. The funders had no role in design and conduct of the study; the collection, management, analysis, or interpretation of the data; or the preparation, review, or approval of the manuscript or its submission for publication.Peer reviewedPublisher PD

    The Use of Routinely Collected Data in Clinical Trial Research

    Get PDF
    RCTs are the gold standard for assessing the effects of medical interventions, but they also pose many challenges, including the often-high costs in conducting them and a potential lack of generalizability of their findings. The recent increase in the availability of so called routinely collected data (RCD) sources has led to great interest in their application to support RCTs in an effort to increase the efficiency of conducting clinical trials. We define all RCTs augmented by RCD in any form as RCD-RCTs. A major subset of RCD-RCTs are performed at the point of care using electronic health records (EHRs) and are referred to as point-of-care research (POC-R). RCD-RCTs offer several advantages over traditional trials regarding patient recruitment and data collection, and beyond. Using highly standardized EHR and registry data allows to assess patient characteristics for trial eligibility and to examine treatment effects through routinely collected endpoints or by linkage to other data sources like mortality registries. Thus, RCD can be used to augment traditional RCTs by providing a sampling framework for patient recruitment and by directly measuring patient relevant outcomes. The result of these efforts is the generation of real-world evidence (RWE). Nevertheless, the utilization of RCD in clinical research brings novel methodological challenges, and issues related to data quality are frequently discussed, which need to be considered for RCD-RCTs. Some of the limitations surrounding RCD use in RCTs relate to data quality, data availability, ethical and informed consent challenges, and lack of endpoint adjudication which may all lead to uncertainties in the validity of their results. The purpose of this thesis is to help fill the aforementioned research gaps in RCD-RCTs, encompassing tasks such as assessing their current application in clinical research and evaluating the methodological and technical challenges in performing them. Furthermore, it aims to assess the reporting quality of published reports on RCD-RCTs

    Spherical microaggregation : anonymizing sparse vector spaces

    Get PDF
    Unstructured texts are a very popular data type and still widely unexplored in the privacy preserving data mining field. We consider the problem of providing public information about a set of confidential documents. To that end we have developed a method to protect a Vector Space Model (VSM), to make it public even if the documents it represents are private. This method is inspired by microaggregation, a popular protection method from statistical disclosure control, and adapted to work with sparse and high dimensional data sets

    Collaborative Privacy-Preserving Analysis of Oncological Data using Multiparty Homomorphic Encryption

    Get PDF
    Real-world healthcare data sharing is instrumental in constructing broader-based and larger clinical data sets that may improve clinical decision-making research and outcomes. Stakeholders are frequently reluctant to share their data without guaranteed patient privacy, proper protection of their data sets, and control over the usage of their data. Fully homomorphic encryption (FHE) is a cryptographic capability that can address these issues by enabling computation on encrypted data without intermediate decryptions, so the analytics results are obtained without revealing the raw data. This work presents a toolset for collaborative privacy-preserving analysis of oncological data using multiparty FHE. Our toolset supports survival analysis, logistic regression training, and several common descriptive statistics. We demonstrate using oncological data sets that the toolset achieves high accuracy and practical performance, which scales well to larger data sets. As part of this work, we propose a novel cryptographic protocol for interactive bootstrapping in multiparty FHE, which is of independent interest. The toolset we develop is general-purpose and can be applied to other collaborative medical and healthcare application domains

    Spectral Anonymization of Data

    No full text
    corecore