19 research outputs found
Generalized Bayesian Record Linkage and Regression with Exact Error Propagation
Record linkage (de-duplication or entity resolution) is the process of
merging noisy databases to remove duplicate entities. While record linkage
removes duplicate entities from such databases, the downstream task is any
inferential, predictive, or post-linkage task on the linked data. One goal of
the downstream task is obtaining a larger reference data set, allowing one to
perform more accurate statistical analyses. In addition, there is inherent
record linkage uncertainty passed to the downstream task. Motivated by the
above, we propose a generalized Bayesian record linkage method and consider
multiple regression analysis as the downstream task. Records are linked via a
random partition model, which allows for a wide class to be considered. In
addition, we jointly model the record linkage and downstream task, which allows
one to account for the record linkage uncertainty exactly. Moreover, one is
able to generate a feedback propagation mechanism of the information from the
proposed Bayesian record linkage model into the downstream task. This feedback
effect is essential to eliminate potential biases that can jeopardize resulting
downstream task. We apply our methodology to multiple linear regression, and
illustrate empirically that the "feedback effect" is able to improve the
performance of record linkage.Comment: 18 pages, 5 figure