702 research outputs found

    Clustering with shallow trees

    Full text link
    We propose a new method for hierarchical clustering based on the optimisation of a cost function over trees of limited depth, and we derive a message--passing method that allows to solve it efficiently. The method and algorithm can be interpreted as a natural interpolation between two well-known approaches, namely single linkage and the recently presented Affinity Propagation. We analyze with this general scheme three biological/medical structured datasets (human population based on genetic information, proteins based on sequences and verbal autopsies) and show that the interpolation technique provides new insight.Comment: 11 pages, 7 figure

    Feature-to-feature regression for a two-step conditional independence test

    No full text
    The algorithms for causal discovery and more broadly for learning the structure of graphical models require well calibrated and consistent conditional independence (CI) tests. We revisit the CI tests which are based on two-step procedures and involve regression with subsequent (unconditional) independence test (RESIT) on regression residuals and investigate the assumptions under which these tests operate. In particular, we demonstrate that when going beyond simple functional relationships with additive noise, such tests can lead to an inflated number of false discoveries. We study the relationship of these tests with those based on dependence measures using reproducing kernel Hilbert spaces (RKHS) and propose an extension of RESIT which uses RKHS-valued regression. The resulting test inherits the simple two-step testing procedure of RESIT, while giving correct Type I control and competitive power. When used as a component of the PC algorithm, the proposed test is more robust to the case where hidden variables induce a switching behaviour in the associations present in the data

    Bayesian kernel two-sample testing

    Get PDF
    In modern data analysis, nonparametric measures of discrepancies between random variables are particularly important. The subject is well-studied in the frequentist literature, while the development in the Bayesian setting is limited where applications are often restricted to univariate cases. Here, we propose a Bayesian kernel two-sample testing procedure based on modelling the difference between kernel mean embeddings in the reproducing kernel Hilbert space utilising the framework established by Flaxman et al (2016). The use of kernel methods enables its application to random variables in generic domains beyond the multivariate Euclidean spaces. The proposed procedure results in a posterior inference scheme that allows an automatic selection of the kernel parameters relevant to the problem at hand. In a series of synthetic experiments and two real data experiments (i.e. testing network heterogeneity from high-dimensional data and six-membered monocyclic ring conformation comparison), we illustrate the advantages of our approach

    Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ "Real-Time Crime Forecasting Challenge"

    Get PDF
    We propose a generic spatiotemporal event forecasting method, which we developed for the National Institute of Justice’s (NIJ) RealTime Crime Forecasting Challenge (National Institute of Justice, 2017). Our method is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. While the smoothing kernels capture the two main approaches in current use in the field of crime forecasting, kernel density estimation (KDE) and self-exciting point process (SEPP) models, the RKHS component of the model can be understood as an approximation to the popular log-Gaussian Cox Process model. For inference, we discretize the spatiotemporal point pattern and learn a log-intensity function using the Poisson likelihood and highly efficient gradientbased optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags, bandwidths for smoothing kernels, as well as cell shape, size, and rotation, were learned using crossvalidation. Resulting predictions significantly exceeded baseline KDE estimates and SEPP models for sparse events

    Evaluating distributional regression strategies for modelling self-reported sexual age-mixing

    Get PDF
    The age dynamics of sexual partnership formation determine patterns of sexually transmitted disease transmission and have long been a focus of researchers studying human immunodeficiency virus. Data on self-reported sexual partner age distributions are available from a variety of sources. We sought to explore statistical models that accurately predict the distribution of sexual partner ages over age and sex. We identified which probability distributions and outcome specifications best captured variation in partner age and quantified the benefits of modelling these data using distributional regression. We found that distributional regression with a sinh-arcsinh distribution replicated observed partner age distributions most accurately across three geographically diverse data sets. This framework can be extended with well-known hierarchical modelling tools and can help improve estimates of sexual age-mixing dynamics

    PriorVAE: encoding spatial priors with variational autoencoders for small-area estimation.

    Get PDF
    Gaussian processes (GPs), implemented through multivariate Gaussian distributions for a finite collection of data, are the most popular approach in small-area spatial statistical modelling. In this context, they are used to encode correlation structures over space and can generalize well in interpolation tasks. Despite their flexibility, off-the-shelf GPs present serious computational challenges which limit their scalability and practical usefulness in applied settings. Here, we propose a novel, deep generative modelling approach to tackle this challenge, termed PriorVAE: for a particular spatial setting, we approximate a class of GP priors through prior sampling and subsequent fitting of a variational autoencoder (VAE). Given a trained VAE, the resultant decoder allows spatial inference to become incredibly efficient due to the low dimensional, independently distributed latent Gaussian space representation of the VAE. Once trained, inference using the VAE decoder replaces the GP within a Bayesian sampling framework. This approach provides tractable and easy-to-implement means of approximately encoding spatial priors and facilitates efficient statistical inference. We demonstrate the utility of our VAE two-stage approach on Bayesian, small-area estimation tasks

    Modeling and forecasting art movements with CGANs

    Get PDF
    Conditional generative adversarial networks (CGANs) are a recent and popular method for generating samples from a probability distribution conditioned on latent information. The latent information often comes in the form of a discrete label from a small set. We propose a novel method for training CGANs which allows us to condition on a sequence of continuous latent distributions f(1), …, f(K). This training allows CGANs to generate samples from a sequence of distributions. We apply our method to paintings from a sequence of artistic movements, where each movement is considered to be its own distribution. Exploiting the temporal aspect of the data, a vector autoregressive (VAR) model is fitted to the means of the latent distributions that we learn, and used for one-step-ahead forecasting, to predict the latent distribution of a future art movement f(K+1). Realizations from this distribution can be used by the CGAN to generate ‘future’ paintings. In experiments, this novel methodology generates accurate predictions of the evolution of art. The training set consists of a large dataset of past paintings. While there is no agreement on exactly what current art period we find ourselves in, we test on plausible candidate sets of present art, and show that the mean distance to our predictions is small

    Probabilistic Analysis of Facility Location on Random Shortest Path Metrics

    Get PDF
    The facility location problem is an NP-hard optimization problem. Therefore, approximation algorithms are often used to solve large instances. Such algorithms often perform much better than worst-case analysis suggests. Therefore, probabilistic analysis is a widely used tool to analyze such algorithms. Most research on probabilistic analysis of NP-hard optimization problems involving metric spaces, such as the facility location problem, has been focused on Euclidean instances, and also instances with independent (random) edge lengths, which are non-metric, have been researched. We would like to extend this knowledge to other, more general, metrics. We investigate the facility location problem using random shortest path metrics. We analyze some probabilistic properties for a simple greedy heuristic which gives a solution to the facility location problem: opening the κ\kappa cheapest facilities (with κ\kappa only depending on the facility opening costs). If the facility opening costs are such that κ\kappa is not too large, then we show that this heuristic is asymptotically optimal. On the other hand, for large values of κ\kappa, the analysis becomes more difficult, and we provide a closed-form expression as upper bound for the expected approximation ratio. In the special case where all facility opening costs are equal this closed-form expression reduces to O(ln(n)4)O(\sqrt[4]{\ln(n)}) or O(1)O(1) or even 1+o(1)1+o(1) if the opening costs are sufficiently small.Comment: A preliminary version accepted to CiE 201
    corecore