1,043 research outputs found

    Multiple imputation for sharing precise geographies in public use data

    Full text link
    When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects' identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Balancing Access to Data And Privacy. A review of the issues and approaches for the future

    Get PDF
    Access to sensitive micro data should be provided using remote access data enclaves. These enclaves should be built to facilitate the productive, high-quality usage of microdata. In other words, they should support a collaborative environment that facilitates the development and exchange of knowledge about data among data producers and consumers. The experience of the physical and life sciences has shown that it is possible to develop a research community and a knowledge infrastructure around both research questions and the different types of data necessary to answer policy questions. In sum, establishing a virtual organization approach would provided the research community with the ability to move away from individual, or artisan, science, towards the more generally accepted community based approach. Enclave should include a number of features: metadata documentation capacity so that knowledge about data can be shared; capacity to add data so that the data infrastructure can be augmented; communication capacity, such as wikis, blogs and discussion groups so that knowledge about the data can be deepened and incentives for information sharing so that a community of practice can be built. The opportunity to transform micro-data based research through such a organizational infrastructure could potentially be as far-reaching as the changes that have taken place in the biological and astronomical sciences. It is, however, an open research question how such an organization should be established: whether the approach should be centralized or decentralized. Similarly, it is an open research question as to the appropriate metrics of success, and the best incentives to put in place to achieve success.Methodology for Collecting, Estimating, Organizing Microeconomic Data

    Data DNA: The Next Generation of Statistical Metadata

    Get PDF
    Describes the components of a complete statistical metadata system and suggests ways to create and structure metadata for better access and understanding of data sets by diverse users

    Releasing survey microdata with exact cluster locations and additional privacy safeguards

    Get PDF
    Household survey programs around the world publish fine-granular georeferenced microdata to support research on the interdependence of human livelihoods and their surrounding environment. To safeguard the respondents’ privacy, micro-level survey data is usually (pseudo)-anonymized through deletion or perturbation procedures such as obfuscating the true location of data collection. This, however, poses a challenge to emerging approaches that augment survey data with auxiliary information on a local level. Here, we propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards through synthetically generated data using generative models. We back our proposal with experiments using data from the 2011 Costa Rican census and satellite-derived auxiliary information. Our strategy reduces the respondents’ re-identification risk for any number of disclosed attributes by 60–80% even under re-identification attempts

    A SMOOTHING APPROACH TO DATA MASKING

    Get PDF
    Individual-level data are often not publicly available due to confidentiality. Instead, masked data are released for public use. However, analyses performed using masked data may produce invalid statistical results such as biased parameter estimates or incorrect standard errors. In this paper, we propose a data masking method using spatial smoothing, and we investigate the bias of parameter estimates resulting from analyses using the masked data for Generalized Linear Models (GLM). The method allows for varying both the form and the degree of masking by utilizing a smoothing weight function and a smoothness parameter. We show that data masking by using a smoothing weight function that accounts for the prior knowledge on the spatial pattern of exposure may lead to less biased parameter estimates when using the masked data for analyses. Under our method, first-order bias of the association between regressors and outcome when estimated using the masked data has a closed-form expression. We apply the method to the study of racial disparities in mortality rates using data on more than 4 million Medicare enrollees residing in 2095 zip codes in the Northeast region of the United States. We find that the bias of the estimated association between race and mortality rates when using the masked data is highly sensitive to both the form and the degree of masking

    Predicting the need for aged care services at the small area level: the CAREMOD spatial microsimulation model

    Get PDF
    Most industrialised societies face rapid population ageing over the next two decades, including sharp increases in the number of people aged 85 years and over. As a result, the supply of and demand for aged care services has assumed increasing policy prominence. The likely spatial distribution of the need for aged care services is critical for planners and policy makers. This article describes the development of a regional microsimulation model of the need for aged care in New South Wales, a state of Australia. It details the methods involved in reweighting the 1998 Survey of Disability, Ageing and Carers, a national level dataset, against the 2001 Census to produce synthetic small area estimates at the statistical local area level. Validation shows that survey variables not constrained in the weighting process can provide unreliable local estimates. A proposed solution to this problem is outlined, involving record cloning, value imputation and alignment. Indicative disability estimates arising from this process are then discussed.Disability, ageing, spatial analysis, aged care, cloning; imputation; alignment; NATSEM

    Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality

    Get PDF
    "To protect the cofidentiality of survey respondents' identities and sensitive attributes, statistical agencies can release data in which cofidential values are replaced with multiple imputations. These are called synthetic data. We propose a two-stage approach to generating synthetic data that enables agencies to release different numbers of imputations for different variables. Generation in two stages can reduce computational burdens, decrease disclosure risk, and increase inferential accuracy relative to generation in one stage. We present methods for obtaining inferences from such data. We describe the application of two stage synthesis to creating a public use file for a German business database." (Author's abstract, IAB-Doku) ((en))IAB-Betriebspanel, Datenaufbereitung, Datenanonymisierung, Datenschutz, angewandte Statistik, statistische Methode, Arbeitsmarktforschung, Imputationsverfahren
    • 

    corecore