23,835 research outputs found
An Intelligent System For Arabic Text Categorization
Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Link Graph Analysis for Adult Images Classification
In order to protect an image search engine's users from undesirable results
adult images' classifier should be built. The information about links from
websites to images is employed to create such a classifier. These links are
represented as a bipartite website-image graph. Each vertex is equipped with
scores of adultness and decentness. The scores for image vertexes are
initialized with zero, those for website vertexes are initialized according to
a text-based website classifier. An iterative algorithm that propagates scores
within a website-image graph is described. The scores obtained are used to
classify images by choosing an appropriate threshold. The experiments on
Internet-scale data have shown that the algorithm under consideration increases
classification recall by 17% in comparison with a simple algorithm which
classifies an image as adult if it is connected with at least one adult site
(at the same precision level).Comment: 7 pages. Young Scientists Conference, 4th Russian Summer School in
Information Retrieva
- âŠ