Skip to main content
Article thumbnail
Location of Repository


By Mr. Niraj Kumar, Mr. Venkata Vinay Babu Vemula, Dr. Kannan Srinathan and Dr. Vasudeva Varma


This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\ud clustering with lesser human involvement, accompanied by effective improvements in result?” In the\ud devised system, we propose a method to exploit the importance of N-grams in a document and use\ud Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\ud in a document depends on several features including, but not limited to: frequency, position of their\ud occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\ud introduce a new similarity measure, which takes the weighted N-gram importance into account, in the\ud calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area

Topics: Statistical Models
Year: 2010
OAI identifier:
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • (external link)
  • (external link)
  • Suggested articles


    1. (2000). A Comparison of document clustering techniques.
    2. (2008). Clustering Documents with Active Learning Using Wikipedia. ICDM
    3. (2007). Clustering Short Texts using Wikipedia; SIGIR’07,
    4. (2005). CorePhrase: Keyphrase Extraction for Document Clustering; In
    5. (2001). Criterion functions for document clustering: experiments and analysis, doi
    6. (2009). Exploiting Wikipedia as External Knowledge for Document Clustering;
    7. (2004). Finding community structure in verylarge networks. Physical Review E,
    8. (2006). Introduction to Data Mining;
    9. (2007). Web Document Clustering by Using Automatic Keyphrase Extraction;

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.