466 research outputs found

    Avoiding disclosure of individually identifiable health information: a literature review

    Get PDF
    Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

    Balancing Access to Data And Privacy. A review of the issues and approaches for the future

    Get PDF
    Access to sensitive micro data should be provided using remote access data enclaves. These enclaves should be built to facilitate the productive, high-quality usage of microdata. In other words, they should support a collaborative environment that facilitates the development and exchange of knowledge about data among data producers and consumers. The experience of the physical and life sciences has shown that it is possible to develop a research community and a knowledge infrastructure around both research questions and the different types of data necessary to answer policy questions. In sum, establishing a virtual organization approach would provided the research community with the ability to move away from individual, or artisan, science, towards the more generally accepted community based approach. Enclave should include a number of features: metadata documentation capacity so that knowledge about data can be shared; capacity to add data so that the data infrastructure can be augmented; communication capacity, such as wikis, blogs and discussion groups so that knowledge about the data can be deepened and incentives for information sharing so that a community of practice can be built. The opportunity to transform micro-data based research through such a organizational infrastructure could potentially be as far-reaching as the changes that have taken place in the biological and astronomical sciences. It is, however, an open research question how such an organization should be established: whether the approach should be centralized or decentralized. Similarly, it is an open research question as to the appropriate metrics of success, and the best incentives to put in place to achieve success.Methodology for Collecting, Estimating, Organizing Microeconomic Data

    Distribution-Preserving Statistical Disclosure Limitation

    Get PDF
    One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.statistical disclosure limitation; confidentiality; privacy; multiple imputation; partially synthetic data

    Advancing Microdata Privacy Protection: A Review of Synthetic Data

    Full text link
    Synthetic data generation is a powerful tool for privacy protection when considering public release of record-level data files. Initially proposed about three decades ago, it has generated significant research and application interest. To meet the pressing demand of data privacy protection in a variety of contexts, the field needs more researchers and practitioners. This review provides a comprehensive introduction to synthetic data, including technical details of their generation and evaluation. Our review also addresses the challenges and limitations of synthetic data, discusses practical applications, and provides thoughts for future work

    Modeling Identity Disclosure Risk Estimation Using Kenyan Situation

    Get PDF
    Identity disclosure risk is an essential consideration in data anonymization aimed at preserving privacy and utility. The risk is regionally dependent. Therefore, there is a need for a regional empirical approach in addition to a theoretical approach in modeling disclosure risk estimation. Reviewed literature pointed to three influencers of the risk. However, we did not find literature on the combined effects of the three influencers and their predictive power. To fill the gap, this study modeled the risk estimation predicated on the combined effect of the three predictors using the Kenyan situation. The study validated the model by conducting an actual re-identification quasi-experiment. The adversary’s analytical competence, distinguishing power of the anonymized datasets, and linkage mapping of the identified datasets are presented as the predictors of the risk estimation. For each predictor, manifest variables are presented. Our presented model extends previous models and is capable of producing a realistic risk estimation

    "Whose data is it anyway?" The implications of putting small area-level health and social data online

    Get PDF
    International audienceThe planetary exospheres are poorly known in their outer parts, since the neutral densities are low compared with the instruments detection capabilities. The exospheric models are thus often the main source of information at such high altitudes. We present a new way to take into account analytically the additional effect of the radiation pressure on planetary exospheres. In a series of papers, we present with an Hamiltonian approach the effect of the radiation pressure on dynamical trajectories, density profiles and escaping thermal flux. Our work is a generalization of the study by Bishop and Chamberlain (1989). In this second part of our work, we present here the density profiles of atomic Hydrogen in planetary exospheres subject to the radiation pressure. We first provide the altitude profiles of ballistic particles (the dominant exospheric population in most cases), which exhibit strong asymmetries that explain the known geotail phenomenon at Earth. The radiation pressure strongly enhances the densities compared with the pure gravity case (i.e. the Chamberlain profiles), in particular at noon and midnight. We finally show the existence of an exopause that appears naturally as the external limit for bounded particles, above which all particles are escaping

    A New Method for Protecting Interrelated Time Series with Bayesian Prior Distributions and Synthetic Data

    Get PDF
    Organizations disseminate statistical summaries of administrative data via the Web for unrestricted public use. They balance the trade-off between confidentiality protection and inference quality. Recent developments in disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The United States Census Bureau collects millions of interrelated time series micro-data that are hierarchical and contain many zeros and suppressions. Rule-based disclosure avoidance techniques often require the suppression of count data for small magnitudes and the modification of data based on a small number of entities. Motivated by this problem, we use zero-inflated extensions of Bayesian Generalized Linear Mixed Models (BGLMM) with privacy-preserving prior distributions to develop methods for protecting and releasing synthetic data from time series about thousands of small groups of entities without suppression based on the of magnitudes or number of entities. We find that as the prior distributions of the variance components in the BGLMM become more precise toward zero, confidentiality protection increases and inference quality deteriorates. We evaluate our methodology using a strict privacy measure, empirical differential privacy, and a newly defined risk measure, Probability of Range Identification (PoRI), which directly measures attribute disclosure risk. We illustrate our results with the U.S. Census Bureau’s Quarterly Workforce Indicators
    • 

    corecore