74,547 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
KACST Arabic Text Classification Project: Overview and Preliminary Results
Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
A Route Confidence Evaluation Method for Reliable Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) is becoming increasingly important
with the rapidly growing amount of text data available in the World Wide Web.
Among the different strategies proposed to cope with HTC, the Local Classifier
per Node (LCN) approach attains good performance by mirroring the underlying
class hierarchy while enforcing a top-down strategy in the testing step.
However, the problem of embedding hierarchical information (parent-child
relationship) to improve the performance of HTC systems still remains open. A
confidence evaluation method for a selected route in the hierarchy is proposed
to evaluate the reliability of the final candidate labels in an HTC system. In
order to take into account the information embedded in the hierarchy, weight
factors are used to take into account the importance of each level. An
acceptance/rejection strategy in the top-down decision making process is
proposed, which improves the overall categorization accuracy by rejecting a few
percentage of samples, i.e., those with low reliability score. Experimental
results on the Reuters benchmark dataset (RCV1- v2) confirm the effectiveness
of the proposed method, compared to other state-of-the art HTC methods
- …