707 research outputs found

    A New Approach to Speeding Up Topic Modeling

    Full text link
    Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic modeling paradigm, and recently finds many applications in computer vision and computational biology. In this paper, we propose a fast and accurate batch algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA algorithms require repeated scanning of the entire corpus and searching the complete topic space. To process massive corpora having a large number of topics, the training iteration of batch LDA algorithms is often inefficient and time-consuming. To accelerate the training speed, ABP actively scans the subset of corpus and searches the subset of topic space for topic modeling, therefore saves enormous training time in each iteration. To ensure accuracy, ABP selects only those documents and topics that contribute to the largest residuals within the residual belief propagation (RBP) framework. On four real-world corpora, ABP performs around 1010 to 100100 times faster than state-of-the-art batch LDA algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure

    A Scalable Asynchronous Distributed Algorithm for Topic Modeling

    Full text link
    Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over TT items in O(logT)O(\log T) time. Moreover, when topic counts change the data structure can be updated in O(logT)O(\log T) time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

    Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

    Full text link
    Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-known text corpora of differing sizes and properties. In particular, we propose and compare two different strategies for sampling the parameter block with latent topic indicators. The experiments show that the increase in statistical inefficiency from only partial collapsing is smaller than commonly assumed, and can be more than compensated by the speedup from parallelization and sparsity on larger corpora. We also prove that the partially collapsed samplers scale well with the size of the corpus. The proposed algorithm is fast, efficient, exact, and can be used in more modeling situations than the ordinary collapsed sampler.Comment: Accepted for publication in Journal of Computational and Graphical Statistic

    Accelerated Parallel Non-conjugate Sampling for Bayesian Non-parametric Models

    Full text link
    Inference of latent feature models in the Bayesian nonparametric setting is generally difficult, especially in high dimensional settings, because it usually requires proposing features from some prior distribution. In special cases, where the integration is tractable, we could sample new feature assignments according to a predictive likelihood. However, this still may not be efficient in high dimensions. We present a novel method to accelerate the mixing of latent variable model inference by proposing feature locations from the data, as opposed to the prior. First, we introduce our accelerated feature proposal mechanism that we will show is a valid Bayesian inference algorithm and next we propose an approximate inference strategy to perform accelerated inference in parallel. This sampling method is efficient for proper mixing of the Markov chain Monte Carlo sampler, computationally attractive, and is theoretically guaranteed to converge to the posterior distribution as its limiting distribution.Comment: Previously known as "Accelerated Inference for Latent Variable Models
    corecore