42 research outputs found
Generalized Bayesian Record Linkage and Regression with Exact Error Propagation
Record linkage (de-duplication or entity resolution) is the process of
merging noisy databases to remove duplicate entities. While record linkage
removes duplicate entities from such databases, the downstream task is any
inferential, predictive, or post-linkage task on the linked data. One goal of
the downstream task is obtaining a larger reference data set, allowing one to
perform more accurate statistical analyses. In addition, there is inherent
record linkage uncertainty passed to the downstream task. Motivated by the
above, we propose a generalized Bayesian record linkage method and consider
multiple regression analysis as the downstream task. Records are linked via a
random partition model, which allows for a wide class to be considered. In
addition, we jointly model the record linkage and downstream task, which allows
one to account for the record linkage uncertainty exactly. Moreover, one is
able to generate a feedback propagation mechanism of the information from the
proposed Bayesian record linkage model into the downstream task. This feedback
effect is essential to eliminate potential biases that can jeopardize resulting
downstream task. We apply our methodology to multiple linear regression, and
illustrate empirically that the "feedback effect" is able to improve the
performance of record linkage.Comment: 18 pages, 5 figure
Probabilistic Clustering of Time-Evolving Distance Data
We present a novel probabilistic clustering model for objects that are
represented via pairwise distances and observed at different time points. The
proposed method utilizes the information given by adjacent time points to find
the underlying cluster structure and obtain a smooth cluster evolution. This
approach allows the number of objects and clusters to differ at every time
point, and no identification on the identities of the objects is needed.
Further, the model does not require the number of clusters being specified in
advance -- they are instead determined automatically using a Dirichlet process
prior. We validate our model on synthetic data showing that the proposed method
is more accurate than state-of-the-art clustering methods. Finally, we use our
dynamic clustering model to analyze and illustrate the evolution of brain
cancer patients over time
Analysis of paediatric visual acuity using Bayesian copula models with sinh-arcsinh marginal densities
We analyse paediatric ophthalmic data from a large sample of children aged between 3 and 8 years. We modify the Bayesian additive conditional bivariate copula regression model of Klein and Kneib [1] by using sinh-arcsinh marginal densities with location, scale and shape parameters that depend smoothly on a covariate. We perform Bayesian inference about the unknown quantities of our model using a specially tailored Markov chain Monte Carlo algorithm. We gain new insights about the processes which determine transformations in visual acuity with respect to age, including the nature of joint changes in both eyes as modelled with the age-related copula dependence parameter. We analyse posterior predictive distributions to identify children with unusual sight characteristics, distinguishing those who are bivariate, but not univariate outliers. In this way we provide an innovative tool that enables clinicians to identify children with unusual sight who may otherwise be missed. We compare our simultaneous Bayesian method with the two-step frequentist generalized additive modelling approach of Vatter and Chavez-Demoulin [2]
Forest modelling: the gamma shape mixture model and simulation of tree diameter distributions
Restricting exchangeable nonparametric distributions
Distributions over exchangeable matrices with infinitely many columns, such as the Indian buffet process, are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly- suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution