With the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. This paper provides a review of generic text classification process, phases of that process and methods being used at each phase. Examples from web page classification and spam classification are provided throughout the text. Principles of operation of four main text classification engines are described – Naïve Bayesian, k Nearest Neighbours, Support Vector Machines and Perceptron Neural Networks. This paper will look through the state of the art in all these phases, take note of methods and algorithms used and of different ways that researchers are trying to reduce computational complexity and improve the precision of text classification process as well as how the text classification is used in practice. The paper is written in a way to avoid extensive use of mathematical formulae in order to be more suited for readers with little or no background in theoretical mathemat
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.