2 research outputs found
Ranking coherence in Topic Models using Statistically Validated Networks
Probabilistic topic models have become one of the most widespread
machine learning techniques in textual analysis. Topic discovering is
an unsupervised process that does not guarantee the interpretability
of its output. Hence, the automatic evaluation of topic coherence
has attracted the interest of many researchers over the last decade,
and it is an open research area. The present article offers a new
quality evaluation method based on Statistically Validated Networks
(SVNs). The proposed probabilistic approach consists of representing
each topic as a weighted network of its most probable words. The
presence of a link between each pair of words is assessed by
statistically validating their co-occurrence in sentences against the null
hypothesis of random co-occurrence. The proposed method allows one
to distinguish between high-quality and low-quality topics, by making
use of a battery of statistical tests. The statistically significant pairwise
associations of words represented by the links in the SVN might
reasonably be expected to be strictly related to the semantic coherence
and interpretability of a topic. Therefore, the more connected the
network, the more coherent the topic in question. We demonstrate the
effectiveness of the method through an analysis of a real text corpus,
which shows that the proposed measure is more correlated with human
judgement than the state-of-the-art coherence measures