32 research outputs found
A Framework for Understanding Latent Semantic Indexing (LSI) Performance
In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information
Improving Search and Retrieval Performance through Shortening Documents and Pruning Out Jargon
This paper describes our participation in the TREC Legal experiments in 2007. We have applied novel normalization techniques that are designed to slightly favor longer documents instead of assuming that all documents should have equal weight. We have also developed a new method for reformulating query text when background information is provided with an information request. We have also experimented with using enhance OCR error detection to reduce the size of the term list and remove noise in the textual data. In this article, we discuss the impact of these effects on the TREC 2007 data sets.