research

The Use of Latent Semantic Indexing to Cluster Documents into Their Subject Areas

Abstract

Keyword matching information retrieval systems areplagued with problems of noise in the document collection, arising from synonymy and polysemy. This noise tends to hide the latent structure of the documents, hence reduing the accuracy of the information retrieval systems, as well asmaking it difficult for clustering algorithms to pick up on shared concepts, and effectively cluster similar documents. Latent Semantic Analysis (LSA) through its use of Singular Value Decomposition reduces the dimension of the document space, mapping it onto a smaller concept space devoid of this noice and making it easier to group similar documents together. This work is an exploratory report of the use of LSA to cluster a small dataset of documents according to their topic areas to see how LSA would fare in comparison to clustering with a clustering package, without LS

    Similar works