15,924 research outputs found

    Latent Dirichlet Allocation (LDA)

    Get PDF
    Supplemental information by the authors of the article "Problems and prospects of hybrid learning in Higher Education"N/

    Sparse Stochastic Inference for Latent Dirichlet allocation

    Full text link
    We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    A Spectral Algorithm for Latent Dirichlet Allocation

    Full text link
    The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on k×kk\times k matrices, where kk is the number of latent factors (e.g. the number of topics), rather than in the dd-dimensional observed space (typically dkd \gg k).Comment: Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 201

    Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

    Get PDF
    Topic modeling is often perceived as a relatively new development in information retrieval sciences, and new methods such as Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation have generated a lot of research. However, attempts to extract topics from unstructured text using Factor Analysis techniques can be found as early as the 1960s. This paper compares the perceived coherence of topics extracted on three different datasets using Factor Analysis and Latent Dirichlet Allocation. To perform such a comparison a new extrinsic evaluation method is proposed. Results suggest that Factor Analysis can produce topics perceived by human coders as more coherent than Latent Dirichlet Allocation and warrant a revisit of a topic extraction method developed more than fifty-five years ago, yet forgotten
    corecore