78,827 research outputs found
How Many Topics? Stability Analysis for Topic Models
Topic modeling refers to the task of discovering the underlying thematic
structure in a text corpus, where the output is commonly presented as a report
of the top terms appearing in each topic. Despite the diversity of topic
modeling algorithms that have been proposed, a common challenge in successfully
applying these techniques is the selection of an appropriate number of topics
for a given corpus. Choosing too few topics will produce results that are
overly broad, while choosing too many will result in the "over-clustering" of a
corpus into many small, highly-similar topics. In this paper, we propose a
term-centric stability analysis strategy to address this issue, the idea being
that a model with an appropriate number of topics will be more robust to
perturbations in the data. Using a topic modeling approach based on matrix
factorization, evaluations performed on a range of corpora show that this
strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification
- …