Search CORE

15,924 research outputs found

Latent Dirichlet Allocation (LDA)

Author: Bidarra José
Rocio Vitor
Publication venue
Publication date: 01/01/2023
Field of study

Supplemental information by the authors of the article "Problems and prospects of hybrid learning in Higher Education"N/

Repositório Aberto da Universidade Aberta

Sparse Stochastic Inference for Latent Dirichlet allocation

Author: Blei David
Hoffman Matt
Mimno David
Publication venue
Publication date: 01/01/2012
Field of study

We present a hybrid algorithm for Bayesian topic models that combines the efficiency of sparse Gibbs sampling with the scalability of online stochastic inference. We used our algorithm to analyze a corpus of 1.2 million books (33 billion words) with thousands of topics. Our approach reduces the bias of variational inference and generalizes to many Bayesian hidden-variable models.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anandkumar Animashree
Foster Dean P.
Hsu Daniel
Kakade Sham M.
Liu Yi-Kai
Publication venue
Publication date: 01/01/2012
Field of study

The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on

k\times k

matrices, where

k

is the number of latent factors (e.g. the number of topics), rather than in the

d

-dimensional observed space (typically

d \gg k

).Comment: Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 201

arXiv.org e-Print Archive

CiteSeerX

Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History

Author: Davoodi Elnaz
Peladeau Normand
Publication venue: AIS Electronic Library (AISeL)
Publication date: 03/01/2018
Field of study

Topic modeling is often perceived as a relatively new development in information retrieval sciences, and new methods such as Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation have generated a lot of research. However, attempts to extract topics from unstructured text using Factor Analysis techniques can be found as early as the 1960s. This paper compares the perceived coherence of topics extracted on three different datasets using Factor Analysis and Latent Dirichlet Allocation. To perform such a comparison a new extrinsic evaluation method is proposed. Results suggest that Factor Analysis can produce topics perceived by human coders as more coherent than Latent Dirichlet Allocation and warrant a revisit of a topic extraction method developed more than fifty-five years ago, yet forgotten

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)