Hierarchical Re-estimation of Topic Models for Measuring Topical
  Diversity

A Solow; C Rao; CD Manning; DD Lewis; DM Blei; DQ Nguyen; H Azarbonyad; H Soleimani; M Dehghani

research

Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

Authors: A Solow
C Rao
CD Manning
DD Lewis
DM Blei
DQ Nguyen
H Azarbonyad
H Soleimani
M Dehghani
Publication date: 1 January 2017
Publisher
Doi

Abstract

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents' topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.Comment: Proceedings of the 39th European Conference on Information Retrieval (ECIR2017