54 research outputs found
Mistake-Driven Learning in Text Categorization
Learning problems in the text processing domain often map the text to a space
whose dimensions are the measured features of the text, e.g., its words. Three
characteristic properties of this domain are (a) very high dimensionality, (b)
both the learned concepts and the instances reside very sparsely in the feature
space, and (c) a high variation in the number of active features in an
instance. In this work we study three mistake-driven learning algorithms for a
typical task of this nature -- text categorization. We argue that these
algorithms -- which categorize documents by learning a linear separator in the
feature space -- have a few properties that make them ideal for this domain. We
then show that a quantum leap in performance is achieved when we further modify
the algorithms to better address some of the specific characteristics of the
domain. In particular, we demonstrate (1) how variation in document length can
be tolerated by either normalizing feature weights or by using negative
weights, (2) the positive effect of applying a threshold range in training, (3)
alternatives in considering feature frequency, and (4) the benefits of
discarding features while training. Overall, we present an algorithm, a
variation of Littlestone's Winnow, which performs significantly better than any
other algorithm tested on this task using a similar feature set.Comment: 9 pages, uses aclap.st
Email categorization using (2+1)-tier classification algorithms
In this paper we have proposed a spam filtering technique using (2+1)-tier classification approach. The main focus of this paper is to reduce the false positive (FP) rate which is considered as an important research issue in spam filtering. In our approach, firstly the email message will classify using first two tier classifiers and the outputs will appear to the analyzer. The analyzer will check the labeling of the output emails and send to the corresponding mailboxes based on labeling, for the case of identical prediction. If there are any misclassifications occurred by first two tier classifiers then tier-3 classifier will invoked by the analyzer and the tier-3 will take final decision. This technique reduced the analyzing complexity of our previous work. It has also been shown that the proposed technique gives better performance in terms of reducing false positive as well as better accuracy.<br /
Email classification using data reduction method
Classifying user emails correctly from penetration of spam is an important research issue for anti-spam researchers. This paper has presented an effective and efficient email classification technique based on data filtering method. In our testing we have introduced an innovative filtering technique using instance selection method (ISM) to reduce the pointless data instances from training model and then classify the test data. The objective of ISM is to identify which instances (examples, patterns) in email corpora should be selected as representatives of the entire dataset, without significant loss of information. We have used WEKA interface in our integrated classification model and tested diverse classification algorithms. Our empirical studies show significant performance in terms of classification accuracy with reduction of false positive instances.<br /
A Performance Evaluation of Classifiers Employ Language Dependent Tools for Indonesian Text
This paper evaluates the performance of Maximum
Entropy (MaxEnt), Support Vector Machine (SVM) and Na¨ıve
Bayes (NB) techniques for Indonesian text classification. Performance
of MaxEnt and SVM techniques are compared against
baseline NB technique. We also investigate the effect of language
dependent tools such as Indonesian stemming and stop words
removal can have on these techniques for text classification performances.
Up to now, there is no experimental report about the
effect of Indonesian stemmer on the text classification accuracy.
From our experiments, we conclude that maximum entropy
performs better than other classifiers in general. Language
dependent tools such as stemming and stop words removal have
only little effect on the accuracy of text classification. However
stemmed approach scored highest average accuracy and due to
the dimension reduction of feature vectors used in classification,
make this approach is viable step in pre-processing stage
Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to trai
Personalized Text Categorization Using a MultiAgent Architecture
In this paper, a system able to retrieve contents deemed
relevant for the users through a text categorization process,
is presented. The system is built exploiting a generic
multiagent architecture that supports the implementation
of applications aimed at (i) retrieving heterogeneous data
spread among different sources (e.g., generic html pages,
news, blogs, forums, and databases); (ii) filtering and organizing
them according to personal interests explicitly stated
by each user; (iii) providing adaptation techniques to improve
and refine throughout time the profile of each selected
user. In particular, the implemented multiagent system creates
personalized press-revies from online newspapers. Preliminary
results are encouraging and highlight the effectiveness
of the approach
- …