In this master thesis we describe a new automatic aspect extraction algorithm by
incorporating relevance information to the dynamics of the Probabilistic Latent
Semantic Analysis. An utility-biased likelihood statistical framework is described
to formalize the incorporation of prior relevance information to the dynamics of
the algorithm intrinsically. Moreover, a general abstract algorithm is presented to
incorporate any arbitrary new feature variables to the analysis.
A tempering procedure is inferred for this general algorithm as an entropic regularization
of the utility-biased likelihood functional and a geometric interpretation of
the algorithm is described, showing the intrinsic changes in the information space of
the problem produced when di erent sources of prior utility estimations are provided
over the same data.
The general algorithm is applied to several information retrieval, recommendation
and personalization tasks. Moreover, a set of post-processing aspect lters is
presented. Some characteristics of the aspect distributions such as sparsity or low
entropy are identi ed to enhance the overall diversity attained by the diversi cation
algorithm. Proposed lters assure that the nal aspect space has those properties,
thus leading to better diversity levels.
An experimental setup over TREC web track 09-12 data shows that the algorithm
surpasses classic pLSA as an aspect extraction tool for the search diversi cation.
Additional theoretical applications of the general procedure to information retrieval,
recommendation and personalization tasks are given, leading to new relevanceaware
models incorporating several variables to the latent semantic analysis.
Finally the problem of optimizing the aspect space size for diversi cation is
addressed. Analytical formulas for the dependency of diversity metrics on the choice
of an automatically extracted aspect space are given under a simpli ed generative
model for the relation between system aspects and evaluation true aspects.
An experimental analysis of this dependence is performed over TREC web track
data using pLSA as aspect extraction algorithm