13 research outputs found

    Estimating identification disclosure risk using mixed membership models

    Full text link
    Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confi dentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and off er an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models.This research was supported by grants from the National Institutes of Health (R21 AG032458-02) and National Science Foundation (SES-11-31897)

    Reality and risk: A refutation of S. Rend贸n鈥檚 analysis of the Peruvian Truth and Reconciliation Commission鈥檚 conflict mortality study

    No full text
    We refute S. Rend贸n鈥檚 recent criticism of the 2003 Peruvian Truth and Reconciliation Commission (TRC) conflict mortality study. We first show that his most important result, an alternative estimate of the mortality due to the Maoist guerrillas of Shining Path ( Sendero Luminoso ), is lower than existing observed data and is therefore impossible. We then analyze his statistical approach and find that it is affected by a subtle form of selection bias. We contrast his approach to the TRC鈥檚 using tools from statistical decision theory, and determine that his method is inadequate for this problem鈥攁nd that the TRC鈥檚 approach is, at minimum, better. Without advocating for the TRC鈥檚 original results, we conclude that Rend贸n鈥檚 approach and methods are inferior to the TRC鈥檚 original work

    Replication Data for: "Reality and risk: a refutation of S. Rendon's analysis of the Peruvian Truth and Reconciliation Commission's mortality study"

    No full text
    Files for replication of results from "Reality and risk: a refutation of S. Rendon's analysis of the Peruvian Truth and Reconciliation Commission's mortality study" by Daniel Manrique-Vallier and Patrick Ball

    A Nonparametric, Multiple Imputation-Based Method for the Retrospective Integration of Data Sets

    No full text
    <div><p>Complex research questions often cannot be addressed adequately with a single data set. One sensible alternative to the high cost and effort associated with the creation of large new data sets is to combine existing data sets containing variables related to the constructs of interest. The goal of the present research was to develop a flexible, broadly applicable approach to the integration of disparate data sets that is based on nonparametric multiple imputation and the collection of data from a convenient, de novo calibration sample. We demonstrate proof of concept for the approach by integrating three existing data sets containing items related to the extent of problematic alcohol use and associations with deviant peers. We discuss both necessary conditions for the approach to work well and potential strengths and weaknesses of the method compared to other data set integration approaches.</p></div
    corecore