224 research outputs found
A Bayesian Approach to Graphical Record Linkage and De-duplication
We propose an unsupervised approach for linking records across arbitrarily
many files, while simultaneously detecting duplicate records within files. Our
key innovation involves the representation of the pattern of links between
records as a bipartite graph, in which records are directly linked to latent
true individuals, and only indirectly linked to other records. This flexible
representation of the linkage structure naturally allows us to estimate the
attributes of the unique observable people in the population, calculate
transitive linkage probabilities across records (and represent this visually),
and propagate the uncertainty of record linkage into later analyses. Our method
makes it particularly easy to integrate record linkage with post-processing
procedures such as logistic regression, capture-recapture, etc. Our linkage
structure lends itself to an efficient, linear-time, hybrid Markov chain Monte
Carlo algorithm, which overcomes many obstacles encountered by previously
record linkage approaches, despite the high-dimensional parameter space. We
illustrate our method using longitudinal data from the National Long Term Care
Survey and with data from the Italian Survey on Household and Wealth, where we
assess the accuracy of our method and show it to be better in terms of error
rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In
press, Journal of the American Statistical Association: Theory and Methods
(2015
Fast Bayesian Record Linkage for Streaming Data Contexts
Record linkage is the task of combining records from multiple files which
refer to overlapping sets of entities when there is no unique identifying
field. In streaming record linkage, files arrive sequentially in time and
estimates of links are updated after the arrival of each file. This problem
arises in settings such as longitudinal surveys, electronic health records, and
online events databases, among others. The challenge in streaming record
linkage is to efficiently update parameter estimates as new data arrives. We
approach the problem from a Bayesian perspective with estimates in the form of
posterior samples of parameters and present methods for updating link estimates
after the arrival of a new file that are faster than fitting a joint model with
each new data file. In this paper, we generalize a two-file Bayesian
Fellegi-Sunter model to the multi-file case and propose two methods to perform
streaming updates. We examine the effect of prior distribution on the resulting
linkage accuracy as well as the computational trade-offs between the methods
when compared to a Gibbs sampler through simulated and real-world survey panel
data. We achieve near-equivalent posterior inference at a small fraction of the
compute time.Comment: 43 pages, 6 figures, 4 tables. (Main: 32 pages, 4 figures, 3 tables.
Supplement: 11 pages, 2 figures, 1 table.) Submitted to Journal of
Computational and Graphical Statistic
A Bayesian Approach to Graphical Record Linkage and Deduplication
© 2016 American Statistical Association.We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online
Recommended from our members
Epistemological Databases for Probabilistic Knowledge Base Construction
Knowledge bases (KB) facilitate real world decision making by providing access to structured relational information that enables pattern discovery and semantic queries. Although there is a large amount of data available for populating a KB; the data must first be gathered and assembled. Traditionally, this integration is performed automatically by storing the output of an information extraction pipeline directly into a database as if this prediction were the ``truth.\u27\u27 However, the resulting KB is often not reliable because (a) errors accumulate in the integration pipeline, and (b) they persist in the KB even after new information arrives that could rectify these errors. We envision a paradigm-shift in KB construction for addressing these concerns that we term an ``epistemological\u27\u27 database. In epistemological databases the existence and properties of entities are not directly input into the DB; they are instead determined by inference on raw evidence input into the DB. This shift in thinking is important because it allows inference to revisit previous conclusions and retroactively correct errors as new evidence arrives. Evidence is abundant and in steady supply from web spiders, semantic web ontologies, external databases, and even groups of enthusiastic human editors. As this evidence continues to accumulate and inference continues to run in the background, the quality of the knowledge base continues to improve. In this dissertation we develop the machine learning components necessary to achieve epistemological knowledge base construction at scale with key contributions in modeling, inference and learning
- …