1 research outputs found
Short Text Topic Modeling Techniques, Applications, and Performance: A Survey
Analyzing short texts infers discriminative and coherent latent topics that
is a critical and fundamental task since many real-world applications require
semantic understanding of short texts. Traditional long text topic modeling
algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this
problem very well since only very limited word co-occurrence information is
available in short texts. Therefore, short text topic modeling has already
attracted much attention from the machine learning research community in recent
years, which aims at overcoming the problem of sparseness in short texts. In
this survey, we conduct a comprehensive review of various short text topic
modeling techniques proposed in the literature. We present three categories of
methods based on Dirichlet multinomial mixture, global word co-occurrences, and
self-aggregation, with example of representative approaches in each category
and analysis of their performance on various tasks. We develop the first
comprehensive open-source library, called STTM, for use in Java that integrates
all surveyed algorithms within a unified interface, benchmark datasets, to
facilitate the expansion of new methods in this research field. Finally, we
evaluate these state-of-the-art methods on many real-world datasets and compare
their performance against one another and versus long text topic modeling
algorithm.Comment: arXiv admin note: text overlap with arXiv:1808.02215 by other author