880 research outputs found

    A Bayesian Approach to Graphical Record Linkage and Deduplication

    Get PDF
    © 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online

    SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication

    Get PDF
    We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate kk-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data

    Structure in the nucleus of NGC 1068 at 10 microns

    Get PDF
    New 8 to 13 micron array camera images of the central kiloparsec of Seyfert 2 galaxy NGC 1068 resolve structure that is similar to that observed at visible and radio wavelengths. The images reveal an infrared source which is extended and asymmetric, with its long axis oriented at P.A. 33 deg. Maps of the spatial distribution of 8 to 13 micron color temperature and warm dust opacity are derived from the multiwavelength infrared images. The results suggest that there exist two pointlike luminosity sources in the central regions of NGC 1068, with the brighter source at the nucleus and the fainter one some 100 pc to the northeast. This geometry strengthens the possibility that the 10 micron emission observed from grains in the nucleus is powered by a nonthermal source. In the context of earlier visible and radio studies, these results considerably strengthen the case for jet induced star formation in NGC 1068

    The 8.3 and 12.4 micron imaging of the Galactic Center source complex with the Goddard infrared array camera

    Get PDF
    A 30 x 30 arcsec field at the Galactic Center (1.5 x 1.5 parsec) was mapped at 8.3 microns and 12.41 microns with high spatial resolution and accurate relative astrometry, using the 16 x 16 Si:Bi accumulation mode charge injection device Goddard infrared array camera. The design and performance of the array camera detector electronics system and image data processing techniques are discussed. Color temperature and dust opacity distributions derived from the spatially accurate images indicate that the compact infrared sources and the large scale ridge structure are bounded by warmer, more diffuse material. None of the objects appear to be heated appreciably by internal luminosity sources. These results are consistent with the model proposing that the complex is heated externally by a strong luminosity source at the Galactic Center, which dominates the energetics of the inner few parsecs of the galaxy

    Differentially Private Model Selection with Penalized and Constrained Likelihood

    Full text link
    In statistical disclosure control, the goal of data analysis is twofold: The released information must provide accurate and useful statistics about the underlying population of interest, while minimizing the potential for an individual record to be identified. In recent years, the notion of differential privacy has received much attention in theoretical computer science, machine learning, and statistics. It provides a rigorous and strong notion of protection for individuals' sensitive information. A fundamental question is how to incorporate differential privacy into traditional statistical inference procedures. In this paper we study model selection in multivariate linear regression under the constraint of differential privacy. We show that model selection procedures based on penalized least squares or likelihood can be made differentially private by a combination of regularization and randomization, and propose two algorithms to do so. We show that our private procedures are consistent under essentially the same conditions as the corresponding non-private procedures. We also find that under differential privacy, the procedure becomes more sensitive to the tuning parameters. We illustrate and evaluate our method using simulation studies and two real data examples

    Sharing Social Network Data: Differentially Private Estimation of Exponential-Family Random Graph Models

    Get PDF
    Motivated by a real-life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyze synthetic graphs in order to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case study using a version of the Enron e-mail corpus dataset demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy \emph{and} supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analyzing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under ϵ\epsilon-edge differential privacy, and then use likelihood based inference for missing data and Markov chain Monte Carlo techniques to fit exponential-family random graph models to the generated synthetic networks.Comment: Updated, 39 page

    Bayesian Exponential Random Graph Models with Nodal Random Effects

    Get PDF
    We extend the well-known and widely used Exponential Random Graph Model (ERGM) by including nodal random effects to compensate for heterogeneity in the nodes of a network. The Bayesian framework for ERGMs proposed by Caimo and Friel (2011) yields the basis of our modelling algorithm. A central question in network models is the question of model selection and following the Bayesian paradigm we focus on estimating Bayes factors. To do so we develop an approximate but feasible calculation of the Bayes factor which allows one to pursue model selection. Two data examples and a small simulation study illustrate our mixed model approach and the corresponding model selection.Comment: 23 pages, 9 figures, 3 table

    R.A.Fisher, design theory, and the Indian connection

    Get PDF
    Design Theory, a branch of mathematics, was born out of the experimental statistics research of the population geneticist R. A. Fisher and of Indian mathematical statisticians in the 1930s. The field combines elements of combinatorics, finite projective geometries, Latin squares, and a variety of further mathematical structures, brought together in surprising ways. This essay will present these structures and ideas as well as how the field came together, in itself an interesting story.Comment: 11 pages, 3 figure
    corecore