16 research outputs found
Probabilistic Size-constrained Microclustering
Abstract Microclustering refers to clustering models that produce small clusters or, equivalently, to models where the size of the clusters grows sublinearly with the number of samples. We formulate probabilistic microclustering models by assigning a prior distribution on the size of the clusters, and in particular consider microclustering models with explicit bounds on the size of the clusters. The combinatorial constraints make full Bayesian inference complicated, but we manage to develop a Gibbs sampling algorithm that can efficiently sample from the joint cluster allocation of all data points. We empirically demonstrate the computational efficiency of the algorithm for problem instances of varying difficulty
Few-to-few Cross-domain Object Matching
Cross-domain object matching refers to the task of inferring unknown alignment between objects in two data collections that do not have a shared data representation. In recent years several methods have been proposed for solving the special case that assumes each object is to be paired with exactly one object, resulting in a constrained optimization problem over permutations. A related problem formulation of cluster matching seeks to match a cluster of objects in one data set to a cluster of objects in the other set, which can be considered as many-to-many extension of cross-domain object matching and can be solved without explicit constraints. In this work we study the intermediate region between these two special cases, presenting a range of Bayesian inference algorithms that work alsofor few-to-few cross-domain object matching problems where constrained optimization is necessary but the optimization domain is broader than just permutations.Peer reviewe
On Controlling the Size of Clusters in Probabilistic Clustering
Classical model-based partitional clustering algorithms, such as k-means or mixture of Gaussians, provide only loose and indirect control over the size of the resulting clusters. In this work, we present a family of probabilistic clustering models that can be steered towards clusters of desired size by providing a prior distribution over the possible sizes, allowing the analyst to fine-tune exploratory analysis or to produce clusters of suitable size for future down-stream processing. Our formulation supports arbitrary multimodal prior distributions, generalizing the previous work on clustering algorithms searching for clusters of equal size or algorithms designed for the microclustering task of finding small clusters. We provide practical methods for solving the problem, using integer programming for making the cluster assignments, and demonstrate that we can also automatically infer the number of clusters.Peer reviewe
Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks
Empirical evidence suggests that heavy-tailed degree distributions occurring
in many real networks are well-approximated by power laws with exponents
that may take values either less than and greater than two. Models based on
various forms of exchangeability are able to capture power laws with , and admit tractable inference algorithms; we draw on previous results to
show that cannot be generated by the forms of exchangeability used
in existing random graph models. Preferential attachment models generate power
law exponents greater than two, but have been of limited use as statistical
models due to the inherent difficulty of performing inference in
non-exchangeable models. Motivated by this gap, we design and implement
inference algorithms for a recently proposed class of models that generates
of all possible values. We show that although they are not exchangeable,
these models have probabilistic structure amenable to inference. Our methods
make a large class of previously intractable models useful for statistical
inference.Comment: Accepted for publication in the proceedings of Conference on
Uncertainty in Artificial Intelligence (UAI) 201
Fast Bayesian Record Linkage for Streaming Data Contexts
Record linkage is the task of combining records from multiple files which
refer to overlapping sets of entities when there is no unique identifying
field. In streaming record linkage, files arrive sequentially in time and
estimates of links are updated after the arrival of each file. This problem
arises in settings such as longitudinal surveys, electronic health records, and
online events databases, among others. The challenge in streaming record
linkage is to efficiently update parameter estimates as new data arrives. We
approach the problem from a Bayesian perspective with estimates in the form of
posterior samples of parameters and present methods for updating link estimates
after the arrival of a new file that are faster than fitting a joint model with
each new data file. In this paper, we generalize a two-file Bayesian
Fellegi-Sunter model to the multi-file case and propose two methods to perform
streaming updates. We examine the effect of prior distribution on the resulting
linkage accuracy as well as the computational trade-offs between the methods
when compared to a Gibbs sampler through simulated and real-world survey panel
data. We achieve near-equivalent posterior inference at a small fraction of the
compute time.Comment: 43 pages, 6 figures, 4 tables. (Main: 32 pages, 4 figures, 3 tables.
Supplement: 11 pages, 2 figures, 1 table.) Submitted to Journal of
Computational and Graphical Statistic
Recommended from our members
COMPACT REPRESENTATIONS OF UNCERTAINTY IN CLUSTERING
Flat clustering and hierarchical clustering are two fundamental tasks, often used to discover meaningful structures in data, such as subtypes of cancer, phylogenetic relationships, taxonomies of concepts, and cascades of particle decays in particle physics. When multiple clusterings of the data are possible, it is useful to represent uncertainty in clustering through various probabilistic quantities, such as the distribution over partitions or tree structures, and the marginal probabilities of subpartitions or subtrees.
Many compact representations exist for structured prediction problems, enabling the efficient computation of probability distributions, e.g., a trellis structure and corresponding Forward-Backward algorithm for Markov models that model sequences. However, no such representation has been proposed for either flat or hierarchical clustering models. In this thesis, we present our work developing data structures and algorithms for computing probability distributions over flat and hierarchical clusterings, as well as for finding maximum a posteriori (MAP) flat and hierarchical clusterings, and various marginal probabilities, as given by a wide range of energy-based clustering models.
First, we describe a trellis structure that compactly represents distributions over flat or hierarchical clusterings. We also describe related data structures that represent approximate distributions. We then present algorithms that, using these structures, allow us to compute the partition function, MAP clustering, and the marginal proba- bilities of a cluster (and sub-hierarchy, in the case of hierarchical clustering) exactly. We also show how these and related algorithms can be used to approximate these values, and analyze the time and space complexity of our proposed methods. We demonstrate the utility of our approaches using various synthetic data of interest as well as in two real world applications, namely particle physics at the Large Hadron Collider at CERN and in cancer genomics. We conclude with a brief discussion of future work
Bayesian Learning of Graph Substructures
Graphical models provide a powerful methodology for learning the conditional
independence structure in multivariate data. Inference is often focused on
estimating individual edges in the latent graph. Nonetheless, there is
increasing interest in inferring more complex structures, such as communities,
for multiple reasons, including more effective information retrieval and better
interpretability. Stochastic blockmodels offer a powerful tool to detect such
structure in a network. We thus propose to exploit advances in random graph
theory and embed them within the graphical models framework. A consequence of
this approach is the propagation of the uncertainty in graph estimation to
large-scale structure learning. We consider Bayesian nonparametric stochastic
blockmodels as priors on the graph. We extend such models to consider
clique-based blocks and to multiple graph settings introducing a novel prior
process based on a dependent Dirichlet process. Moreover, we devise a tailored
computation strategy of Bayes factors for block structure based on the
Savage-Dickey ratio to test for presence of larger structure in a graph. We
demonstrate our approach in simulations as well as on real data applications in
finance and transcriptomics.Comment: 35 pages, 7 figure
Bayesian methods and data science with health informatics data
Cancer is a complex disease, driven by a range of genetic and environmental factors. Every year millions of people are diagnosed with a type of cancer and the survival prognosis for many of them is poor due to the lack of understanding of the causes of some cancers. Modern large-scale studies offer a great opportunity to study the mechanisms underlying different types of cancer but also brings the challenges of selecting informative features, estimating the number of cancer subtypes, and providing interpretative results.
In this thesis, we address these challenges by developing efficient clustering algorithms based on Dirichlet process mixture models which can be applied to different data types (continuous, discrete, mixed) and to multiple data sources (in our case, molecular and clinical data) simultaneously. We show how our methodology addresses the drawbacks of widely used clustering methods such as k-means and iClusterPlus. We also introduce a more efficient version of the clustering methods by using simulated annealing in the inference stage.
We apply the data integration methods to data from The Cancer Genome Atlas (TCGA), which include clinical and molecular data about glioblastoma, breast cancer, colorectal cancer, and pancreatic cancer. We find subtypes which are prognostic of the overall survival in two aggressive types of cancer: pancreatic cancer and glioblastoma, which were not identified by the comparison models. We analyse a Hospital Episode Statistics (HES) dataset comprising clinical information about all pancreatic cancer patients in the United Kingdom operated during the period 2001 - 2016. We investigate the effect of centralisation on the short- and long-term survival of the patients, and the factors affecting the patient survival. Our analyses show that higher volume surgery centres are associated with lower 90-day mortality rates and that age, index of multiple deprivation and diagnosis type are significant risk factors for the short-term survival.
Our findings suggest the analysis of large complex molecular datasets coupled with methodology advances can allow us to gain valuable insights in the cancer genome and the associated molecular mechanisms