63 research outputs found
Recommended from our members
Mixtures of Hierarchical Topics with Pachinko Allocation
The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG struc- ture. It does not, however, represent a nested hierarchy of topics, with some top- ical word distributions representing the vo- cabulary that is shared among several more specic topics. This paper presents hierar- chical PAM|an enhancement that explic- itly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA\u27s topical hierarchy representation with PAM\u27s ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out docu- ments, as well as mutual information between automatically-discovered topics and human- generated categories such as journals
Efficient Correlated Topic Modeling with Topic Embedding
Correlated topic modeling has been limited to small model and problem sizes
due to their high computational cost and poor scaling. In this paper, we
propose a new model which learns compact topic embeddings and captures topic
correlations through the closeness between the topic vectors. Our method
enables efficient inference in the low-dimensional embedding space, reducing
previous cubic or quadratic time complexity to linear w.r.t the topic size. We
further speedup variational inference with a fast sampler to exploit sparsity
of topic occurrence. Extensive experiments show that our approach is capable of
handling model and data scales which are several orders of magnitude larger
than existing correlation results, without sacrificing modeling quality by
providing competitive or superior performance in document classification and
retrieval.Comment: KDD 2017 oral. The first two authors contributed equall
Conditional Hierarchical Bayesian Tucker Decomposition
Our research focuses on studying and developing methods for reducing the
dimensionality of large datasets, common in biomedical applications. A major
problem when learning information about patients based on genetic sequencing
data is that there are often more feature variables (genetic data) than
observations (patients). This makes direct supervised learning difficult. One
way of reducing the feature space is to use latent Dirichlet allocation in
order to group genetic variants in an unsupervised manner. Latent Dirichlet
allocation is a common model in natural language processing, which describes a
document as a mixture of topics, each with a probability of generating certain
words. This can be generalized as a Bayesian tensor decomposition to account
for multiple feature variables. While we made some progress improving and
modifying these methods, our significant contributions are with hierarchical
topic modeling. We developed distinct methods of incorporating hierarchical
topic modeling, based on nested Chinese restaurant processes and Pachinko
Allocation Machine, into Bayesian tensor decompositions. We apply these models
to predict whether or not patients have autism spectrum disorder based on
genetic sequencing data. We examine a dataset from National Database for Autism
Research consisting of paired siblings -- one with autism, and the other
without -- and counts of their genetic variants. Additionally, we linked the
genes with their Reactome biological pathways. We combine this information into
a tensor of patients, counts of their genetic variants, and the membership of
these genes in pathways. Once we decompose this tensor, we use logistic
regression on the reduced features in order to predict if patients have autism.
We also perform a similar analysis of a dataset of patients with one of four
common types of cancer (breast, lung, prostate, and colorectal).Comment: 20 pages, added model evaluation and log-likelihood section
Infinite factorization of multiple non-parametric views
Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering
setting, by introducing a novel non-parametric hierarchical
mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block
model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views.
Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation
- …