Skip to main content
Article thumbnail
Location of Repository

Text classification method review

By Aigars Mahinovs, Ashutosh Tiwari, Rajkumar (series editor) Roy and David (series editor) Baxter

Abstract

With the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. This paper provides a review of generic text classification process, phases of that process and methods being used at each phase. Examples from web page classification and spam classification are provided throughout the text. Principles of operation of four main text classification engines are described – Naïve Bayesian, k Nearest Neighbours, Support Vector Machines and Perceptron Neural Networks. This paper will look through the state of the art in all these phases, take note of methods and algorithms used and of different ways that researchers are trying to reduce computational complexity and improve the precision of text classification process as well as how the text classification is used in practice. The paper is written in a way to avoid extensive use of mathematical formulae in order to be more suited for readers with little or no background in theoretical mathemat

Topics: Text classification, Bayes, kNN, SVM, Neural network, Feature extraction, Feature reduction, Web page classification
Year: 2007
OAI identifier: oai:dspace.lib.cranfield.ac.uk:1826/1860
Provided by: Cranfield CERES

Suggested articles

Citations

  1. (2005). 16Text classification method review
  2. (2002). A brief survey of web data extraction tools. doi
  3. (1997). A Comparative Study on Feature Selection in Text Categorization. doi
  4. (2005). A LVQ-based neural network anti-spam email approach. doi
  5. (2003). A maximal figure-of-merit learning approach to text categorization. doi
  6. (2004). An evaluation of statistical spam filtering techniques. doi
  7. (2000). Automatic text categorization in terms of genre and author. doi
  8. (1998). Boosting and Rocchio applied to text filtering. doi
  9. (2003). Clustering documents in a web directory. doi
  10. (2003). Combining link-based and content-based methods for web document classification. doi
  11. (2004). Context-based methods for text categorisation. doi
  12. (1998). Corpus-based stemming using cooccurrence of word variants. doi
  13. (2005). Detecting phrase-level duplication on the world wide web. doi
  14. (2003). Document classification via structure synopses.
  15. (1995). Extensible classifier for semi-structured documents. doi
  16. (1999). Fast Identification of Stop Words for Font Learning and Keyword Spotting. doi
  17. (1998). Feature Selection for Knowledge Discovery and Data Mining. : doi
  18. (1997). Feature selection, perception learning, and a usability case study for text categorization. doi
  19. (2000). Hierarchical classification of Web content. doi
  20. (2005). Intelligent GP fusion from multiple sources for text classification. doi
  21. (2000). Learning from Labeled and Unlabeled Documents: A Comparative Study on Semi-Supervised Text Classification. doi
  22. (2005). Learning to crawl: Comparing classification schemes. doi
  23. (1995). Little words can make a big difference for text classification. doi
  24. (2005). Narrative text classification for automatic key phrase extraction in web document corpora. doi
  25. (2002). Simple and accurate feature selection for hierarchical categorisation. doi
  26. (2003). Text categorization based on k-nearest neighbor approach for Web site classification. doi
  27. (2004). Using urls and table layout for web classification tasks. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.