42 research outputs found

    Generalized Bayesian Record Linkage and Regression with Exact Error Propagation

    Full text link
    Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference data set, allowing one to perform more accurate statistical analyses. In addition, there is inherent record linkage uncertainty passed to the downstream task. Motivated by the above, we propose a generalized Bayesian record linkage method and consider multiple regression analysis as the downstream task. Records are linked via a random partition model, which allows for a wide class to be considered. In addition, we jointly model the record linkage and downstream task, which allows one to account for the record linkage uncertainty exactly. Moreover, one is able to generate a feedback propagation mechanism of the information from the proposed Bayesian record linkage model into the downstream task. This feedback effect is essential to eliminate potential biases that can jeopardize resulting downstream task. We apply our methodology to multiple linear regression, and illustrate empirically that the "feedback effect" is able to improve the performance of record linkage.Comment: 18 pages, 5 figure

    Probabilistic Clustering of Time-Evolving Distance Data

    Full text link
    We present a novel probabilistic clustering model for objects that are represented via pairwise distances and observed at different time points. The proposed method utilizes the information given by adjacent time points to find the underlying cluster structure and obtain a smooth cluster evolution. This approach allows the number of objects and clusters to differ at every time point, and no identification on the identities of the objects is needed. Further, the model does not require the number of clusters being specified in advance -- they are instead determined automatically using a Dirichlet process prior. We validate our model on synthetic data showing that the proposed method is more accurate than state-of-the-art clustering methods. Finally, we use our dynamic clustering model to analyze and illustrate the evolution of brain cancer patients over time

    Analysis of paediatric visual acuity using Bayesian copula models with sinh-arcsinh marginal densities

    Get PDF
    We analyse paediatric ophthalmic data from a large sample of children aged between 3 and 8 years. We modify the Bayesian additive conditional bivariate copula regression model of Klein and Kneib [1] by using sinh-arcsinh marginal densities with location, scale and shape parameters that depend smoothly on a covariate. We perform Bayesian inference about the unknown quantities of our model using a specially tailored Markov chain Monte Carlo algorithm. We gain new insights about the processes which determine transformations in visual acuity with respect to age, including the nature of joint changes in both eyes as modelled with the age-related copula dependence parameter. We analyse posterior predictive distributions to identify children with unusual sight characteristics, distinguishing those who are bivariate, but not univariate outliers. In this way we provide an innovative tool that enables clinicians to identify children with unusual sight who may otherwise be missed. We compare our simultaneous Bayesian method with the two-step frequentist generalized additive modelling approach of Vatter and Chavez-Demoulin [2]

    Restricting exchangeable nonparametric distributions

    No full text
    Distributions over exchangeable matrices with infinitely many columns, such as the Indian buffet process, are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly- suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution