3 research outputs found

    Dynamic Document Clustering Using Singular Value Decomposition

    No full text
    Document Clustering is a widely researched area in data mining. It is a technique of grouping similar documents based on a measure of similarity. Document Clustering forms an important aspect in Information Retrieval for improving precision and recall in search applications, navigation and presentation of search results. But due to the tremendous amount of features, textual data suffers from the Curse of Dimensionality. Moreover, adding new features increases the noise in the data. To address these issues, in this thesis we investigate the use of Singular Value Decomposition (SVD) and propose a sophisticated Document Clustering algorithm combining folding-in method and k-means algorithm, to efficiently store and dynamically incorporate new textual data into the existing cluster formations. We test our approach by introducing new documents in increments of 1%, 5%, 10%, 15%, and 20%. These new documents are added in two variations. One document set comprises of completely new documents and the other is formed by modifying the existing documents. Our method promises significant improvements in computation costs, storage costs and cluster quality compared to recomputing-SVD method. We also present a novel approach for retrieving documents of interest to the users. The user can choose documents using different window sizes either time windows or subset of documents. Our experimental evaluations show that the proposed method of document retrieval outperforms recomputing-SVD method significantly in computation time with promise of flexibility and good cluster quality

    Dynamic Document Clustering Using Singular Value Decomposition

    No full text
    Document Clustering is a widely researched area in data mining. It is a technique of grouping similar documents based on a measure of similarity. Document Clustering forms an important aspect in Information Retrieval for improving precision and recall in search applications, navigation and presentation of search results. But due to the tremendous amount of features, textual data suffers from the Curse of Dimensionality. Moreover, adding new features increases the noise in the data. To address these issues, in this thesis we investigate the use of Singular Value Decomposition (SVD) and propose a sophisticated Document Clustering algorithm combining folding-in method and k-means algorithm, to efficiently store and dynamically incorporate new textual data into the existing cluster formations. We test our approach by introducing new documents in increments of 1%, 5%, 10%, 15%, and 20%. These new documents are added in two variations. One document set comprises of completely new documents and the other is formed by modifying the existing documents. Our method promises significant improvements in computation costs, storage costs and cluster quality compared to recomputing-SVD method. We also present a novel approach for retrieving documents of interest to the users. The user can choose documents using different window sizes either time windows or subset of documents. Our experimental evaluations show that the proposed method of document retrieval outperforms recomputing-SVD method significantly in computation time with promise of flexibility and good cluster quality
    corecore