Location of Repository

EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

By Mr. Niraj Kumar, Mr. Venkata Vinay Babu Vemula, Dr. Kannan Srinathan and Dr. Vasudeva Varma

Abstract

This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\ud clustering with lesser human involvement, accompanied by effective improvements in result?” In the\ud devised system, we propose a method to exploit the importance of N-grams in a document and use\ud Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\ud in a document depends on several features including, but not limited to: frequency, position of their\ud occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\ud introduce a new similarity measure, which takes the weighted N-gram importance into account, in the\ud calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area

Topics: Statistical Models
Year: 2010
OAI identifier: oai:cogprints.org:7148
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://cogprints.org/7148/1/KD... (external link)
  • http://cogprints.org/7148/ (external link)
  • Suggested articles

    Preview

    Citations

    1. (2000). A Comparison of document clustering techniques.
    2. (2008). Clustering Documents with Active Learning Using Wikipedia. ICDM
    3. (2007). Clustering Short Texts using Wikipedia; SIGIR’07,
    4. (2005). CorePhrase: Keyphrase Extraction for Document Clustering; In
    5. (2001). Criterion functions for document clustering: experiments and analysis, doi
    6. (2009). Exploiting Wikipedia as External Knowledge for Document Clustering;
    7. (2004). Finding community structure in verylarge networks. Physical Review E,
    8. (2006). Introduction to Data Mining;
    9. (2007). Web Document Clustering by Using Automatic Keyphrase Extraction;

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.