55,120 research outputs found

    A Bayesian Approach to Graphical Record Linkage and De-duplication

    Full text link
    We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In press, Journal of the American Statistical Association: Theory and Methods (2015

    Bayesian outlier detection in Capital Asset Pricing Model

    Full text link
    We propose a novel Bayesian optimisation procedure for outlier detection in the Capital Asset Pricing Model. We use a parametric product partition model to robustly estimate the systematic risk of an asset. We assume that the returns follow independent normal distributions and we impose a partition structure on the parameters of interest. The partition structure imposed on the parameters induces a corresponding clustering of the returns. We identify via an optimisation procedure the partition that best separates standard observations from the atypical ones. The methodology is illustrated with reference to a real data set, for which we also provide a microeconomic interpretation of the detected outliers

    Support vector machine for functional data classification

    Get PDF
    In many applications, input data are sampled functions taking their values in infinite dimensional spaces rather than standard vectors. This fact has complex consequences on data analysis algorithms that motivate modifications of them. In fact most of the traditional data analysis tools for regression, classification and clustering have been adapted to functional inputs under the general name of functional Data Analysis (FDA). In this paper, we investigate the use of Support Vector Machines (SVMs) for functional data analysis and we focus on the problem of curves discrimination. SVMs are large margin classifier tools based on implicit non linear mappings of the considered data into high dimensional spaces thanks to kernels. We show how to define simple kernels that take into account the unctional nature of the data and lead to consistent classification. Experiments conducted on real world data emphasize the benefit of taking into account some functional aspects of the problems.Comment: 13 page
    • …
    corecore