A high degree of topical diversity is often considered to be an important
characteristic of interesting text documents. A recent proposal for measuring
topical diversity identifies three elements for assessing diversity: words,
topics, and documents as collections of words. Topic models play a central role
in this approach. Using standard topic models for measuring diversity of
documents is suboptimal due to generality and impurity. General topics only
include common information from a background corpus and are assigned to most of
the documents in the collection. Impure topics contain words that are not
related to the topic; impurity lowers the interpretability of topic models and
impure topics are likely to get assigned to documents erroneously. We propose a
hierarchical re-estimation approach for topic models to combat generality and
impurity; the proposed approach operates at three levels: words, topics, and
documents. Our re-estimation approach for measuring documents' topical
diversity outperforms the state of the art on PubMed dataset which is commonly
used for diversity experiments.Comment: Proceedings of the 39th European Conference on Information Retrieval
(ECIR2017