64,122 research outputs found
Ranking coherence in Topic Models using Statistically Validated Networks
Probabilistic topic models have become one of the most widespread
machine learning techniques in textual analysis. Topic discovering is
an unsupervised process that does not guarantee the interpretability
of its output. Hence, the automatic evaluation of topic coherence
has attracted the interest of many researchers over the last decade,
and it is an open research area. The present article offers a new
quality evaluation method based on Statistically Validated Networks
(SVNs). The proposed probabilistic approach consists of representing
each topic as a weighted network of its most probable words. The
presence of a link between each pair of words is assessed by
statistically validating their co-occurrence in sentences against the null
hypothesis of random co-occurrence. The proposed method allows one
to distinguish between high-quality and low-quality topics, by making
use of a battery of statistical tests. The statistically significant pairwise
associations of words represented by the links in the SVN might
reasonably be expected to be strictly related to the semantic coherence
and interpretability of a topic. Therefore, the more connected the
network, the more coherent the topic in question. We demonstrate the
effectiveness of the method through an analysis of a real text corpus,
which shows that the proposed measure is more correlated with human
judgement than the state-of-the-art coherence measures
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
- …