16 research outputs found

    Probabilistic Size-constrained Microclustering

    Get PDF
    Abstract Microclustering refers to clustering models that produce small clusters or, equivalently, to models where the size of the clusters grows sublinearly with the number of samples. We formulate probabilistic microclustering models by assigning a prior distribution on the size of the clusters, and in particular consider microclustering models with explicit bounds on the size of the clusters. The combinatorial constraints make full Bayesian inference complicated, but we manage to develop a Gibbs sampling algorithm that can efficiently sample from the joint cluster allocation of all data points. We empirically demonstrate the computational efficiency of the algorithm for problem instances of varying difficulty

    Few-to-few Cross-domain Object Matching

    Get PDF
    Cross-domain object matching refers to the task of inferring unknown alignment between objects in two data collections that do not have a shared data representation. In recent years several methods have been proposed for solving the special case that assumes each object is to be paired with exactly one object, resulting in a constrained optimization problem over permutations. A related problem formulation of cluster matching seeks to match a cluster of objects in one data set to a cluster of objects in the other set, which can be considered as many-to-many extension of cross-domain object matching and can be solved without explicit constraints. In this work we study the intermediate region between these two special cases, presenting a range of Bayesian inference algorithms that work alsofor few-to-few cross-domain object matching problems where constrained optimization is necessary but the optimization domain is broader than just permutations.Peer reviewe

    On Controlling the Size of Clusters in Probabilistic Clustering

    Get PDF
    Classical model-based partitional clustering algorithms, such as k-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing. Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.Peer reviewe

    Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks

    Full text link
    Empirical evidence suggests that heavy-tailed degree distributions occurring in many real networks are well-approximated by power laws with exponents η\eta that may take values either less than and greater than two. Models based on various forms of exchangeability are able to capture power laws with η<2\eta < 2, and admit tractable inference algorithms; we draw on previous results to show that η>2\eta > 2 cannot be generated by the forms of exchangeability used in existing random graph models. Preferential attachment models generate power law exponents greater than two, but have been of limited use as statistical models due to the inherent difficulty of performing inference in non-exchangeable models. Motivated by this gap, we design and implement inference algorithms for a recently proposed class of models that generates η\eta of all possible values. We show that although they are not exchangeable, these models have probabilistic structure amenable to inference. Our methods make a large class of previously intractable models useful for statistical inference.Comment: Accepted for publication in the proceedings of Conference on Uncertainty in Artificial Intelligence (UAI) 201

    Fast Bayesian Record Linkage for Streaming Data Contexts

    Full text link
    Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrives. We approach the problem from a Bayesian perspective with estimates in the form of posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this paper, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time.Comment: 43 pages, 6 figures, 4 tables. (Main: 32 pages, 4 figures, 3 tables. Supplement: 11 pages, 2 figures, 1 table.) Submitted to Journal of Computational and Graphical Statistic

    Bayesian Learning of Graph Substructures

    Get PDF
    Graphical models provide a powerful methodology for learning the conditional independence structure in multivariate data. Inference is often focused on estimating individual edges in the latent graph. Nonetheless, there is increasing interest in inferring more complex structures, such as communities, for multiple reasons, including more effective information retrieval and better interpretability. Stochastic blockmodels offer a powerful tool to detect such structure in a network. We thus propose to exploit advances in random graph theory and embed them within the graphical models framework. A consequence of this approach is the propagation of the uncertainty in graph estimation to large-scale structure learning. We consider Bayesian nonparametric stochastic blockmodels as priors on the graph. We extend such models to consider clique-based blocks and to multiple graph settings introducing a novel prior process based on a dependent Dirichlet process. Moreover, we devise a tailored computation strategy of Bayes factors for block structure based on the Savage-Dickey ratio to test for presence of larger structure in a graph. We demonstrate our approach in simulations as well as on real data applications in finance and transcriptomics.Comment: 35 pages, 7 figure

    Bayesian methods and data science with health informatics data

    Get PDF
    Cancer is a complex disease, driven by a range of genetic and environmental factors. Every year millions of people are diagnosed with a type of cancer and the survival prognosis for many of them is poor due to the lack of understanding of the causes of some cancers. Modern large-scale studies offer a great opportunity to study the mechanisms underlying different types of cancer but also brings the challenges of selecting informative features, estimating the number of cancer subtypes, and providing interpretative results. In this thesis, we address these challenges by developing efficient clustering algorithms based on Dirichlet process mixture models which can be applied to different data types (continuous, discrete, mixed) and to multiple data sources (in our case, molecular and clinical data) simultaneously. We show how our methodology addresses the drawbacks of widely used clustering methods such as k-means and iClusterPlus. We also introduce a more efficient version of the clustering methods by using simulated annealing in the inference stage. We apply the data integration methods to data from The Cancer Genome Atlas (TCGA), which include clinical and molecular data about glioblastoma, breast cancer, colorectal cancer, and pancreatic cancer. We find subtypes which are prognostic of the overall survival in two aggressive types of cancer: pancreatic cancer and glioblastoma, which were not identified by the comparison models. We analyse a Hospital Episode Statistics (HES) dataset comprising clinical information about all pancreatic cancer patients in the United Kingdom operated during the period 2001 - 2016. We investigate the effect of centralisation on the short- and long-term survival of the patients, and the factors affecting the patient survival. Our analyses show that higher volume surgery centres are associated with lower 90-day mortality rates and that age, index of multiple deprivation and diagnosis type are significant risk factors for the short-term survival. Our findings suggest the analysis of large complex molecular datasets coupled with methodology advances can allow us to gain valuable insights in the cancer genome and the associated molecular mechanisms
    corecore