We consider the problem of creating document representations in which
inter-document similarity measurements correspond to semantic similarity. We
first present a novel subspace-based framework for formalizing this task. Using
this framework, we derive a new analysis of Latent Semantic Indexing (LSI),
showing a precise relationship between its performance and the uniformity of
the underlying distribution of documents over topics. This analysis helps
explain the improvements gained by Ando's (2000) Iterative Residual Rescaling
(IRR) algorithm: IRR can compensate for distributional non-uniformity. A
further benefit of our framework is that it provides a well-motivated,
effective method for automatically determining the rescaling factor IRR depends
on, leading to further improvements. A series of experiments over various
settings and with several evaluation metrics validates our claims.Comment: To appear in the proceedings of SIGIR 2001. 11 page