thesis

Automatic aspect extraction in information retrieval diversity

Abstract

In this master thesis we describe a new automatic aspect extraction algorithm by incorporating relevance information to the dynamics of the Probabilistic Latent Semantic Analysis. An utility-biased likelihood statistical framework is described to formalize the incorporation of prior relevance information to the dynamics of the algorithm intrinsically. Moreover, a general abstract algorithm is presented to incorporate any arbitrary new feature variables to the analysis. A tempering procedure is inferred for this general algorithm as an entropic regularization of the utility-biased likelihood functional and a geometric interpretation of the algorithm is described, showing the intrinsic changes in the information space of the problem produced when di erent sources of prior utility estimations are provided over the same data. The general algorithm is applied to several information retrieval, recommendation and personalization tasks. Moreover, a set of post-processing aspect lters is presented. Some characteristics of the aspect distributions such as sparsity or low entropy are identi ed to enhance the overall diversity attained by the diversi cation algorithm. Proposed lters assure that the nal aspect space has those properties, thus leading to better diversity levels. An experimental setup over TREC web track 09-12 data shows that the algorithm surpasses classic pLSA as an aspect extraction tool for the search diversi cation. Additional theoretical applications of the general procedure to information retrieval, recommendation and personalization tasks are given, leading to new relevanceaware models incorporating several variables to the latent semantic analysis. Finally the problem of optimizing the aspect space size for diversi cation is addressed. Analytical formulas for the dependency of diversity metrics on the choice of an automatically extracted aspect space are given under a simpli ed generative model for the relation between system aspects and evaluation true aspects. An experimental analysis of this dependence is performed over TREC web track data using pLSA as aspect extraction algorithm

    Similar works

    Full text

    thumbnail-image

    Available Versions