Search CORE

54 research outputs found

Mistake-Driven Learning in Text Categorization

Author: Dagan Ido
Karov Yael
Roth Dan
Publication venue
Publication date: 01/01/1997
Field of study

Learning problems in the text processing domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Three characteristic properties of this domain are (a) very high dimensionality, (b) both the learned concepts and the instances reside very sparsely in the feature space, and (c) a high variation in the number of active features in an instance. In this work we study three mistake-driven learning algorithms for a typical task of this nature -- text categorization. We argue that these algorithms -- which categorize documents by learning a linear separator in the feature space -- have a few properties that make them ideal for this domain. We then show that a quantum leap in performance is achieved when we further modify the algorithms to better address some of the specific characteristics of the domain. In particular, we demonstrate (1) how variation in document length can be tolerated by either normalizing feature weights or by using negative weights, (2) the positive effect of applying a threshold range in training, (3) alternatives in considering feature frequency, and (4) the benefits of discarding features while training. Overall, we present an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.Comment: 9 pages, uses aclap.st

arXiv.org e-Print Archive

CiteSeerX

Dynamic feature selection for spam filtering using support vector machine

Author: Chowdhury Morshed
Islam Md. Rafiqul
Zhou Wanlei
Publication venue: International Association for Computer & Information Science
Publication date: 01/01/2008
Field of study

Deakin Research Online

Email categorization using (2+1)-tier classification algorithms

Author: Chowdhury Morshed U.
Islam Md. Rafiqul
Zhou Wanlei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

In this paper we have proposed a spam filtering technique using (2+1)-tier classification approach. The main focus of this paper is to reduce the false positive (FP) rate which is considered as an important research issue in spam filtering. In our approach, firstly the email message will classify using first two tier classifiers and the outputs will appear to the analyzer. The analyzer will check the labeling of the output emails and send to the corresponding mailboxes based on labeling, for the case of identical prediction. If there are any misclassifications occurred by first two tier classifiers then tier-3 classifier will invoked by the analyzer and the tier-3 will take final decision. This technique reduced the analyzing complexity of our previous work. It has also been shown that the proposed technique gives better performance in terms of reducing false positive as well as better accuracy.<br /

Deakin Research Online

Email classification using data reduction method

Author: Islam Rafiqul
Xiang Yang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Classifying user emails correctly from penetration of spam is an important research issue for anti-spam researchers. This paper has presented an effective and efficient email classification technique based on data filtering method. In our testing we have introduced an innovative filtering technique using instance selection method (ISM) to reduce the pointless data instances from training model and then classify the test data. The objective of ISM is to identify which instances (examples, patterns) in email corpora should be selected as representatives of the entire dataset, without significant loss of information. We have used WEKA interface in our integrated classification model and tested diverse classification algorithms. Our empirical studies show significant performance in terms of classification accuracy with reduction of false positive instances.<br /

Deakin Research Online

A Performance Evaluation of Classifiers Employ Language Dependent Tools for Indonesian Text

Author: Arifin Agus Zainal
Hariadi Mochamad
Purnomo Mauridhi Hery
Sumpeno Surya
Publication venue
Publication date: 01/01/2010
Field of study

This paper evaluates the performance of Maximum Entropy (MaxEnt), Support Vector Machine (SVM) and Na¨ıve Bayes (NB) techniques for Indonesian text classification. Performance of MaxEnt and SVM techniques are compared against baseline NB technique. We also investigate the effect of language dependent tools such as Indonesian stemming and stop words removal can have on these techniques for text classification performances. Up to now, there is no experimental report about the effect of Indonesian stemmer on the text classification accuracy. From our experiments, we conclude that maximum entropy performs better than other classifiers in general. Language dependent tools such as stemming and stop words removal have only little effect on the accuracy of text classification. However stemmed approach scored highest average accuracy and due to the dimension reduction of feature vectors used in classification, make this approach is viable step in pre-processing stage

ITS Repository

Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization

Author: Collada Pérez Sonia
González Cristóbal José Carlos
Lana Serrano Sara
Villena Román Julio
Publication venue: E.U.I.T. Telecomunicación (UPM)
Publication date: 01/01/2011
Field of study

This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to trai

Archivo Digital UPM

Personalized Text Categorization Using a MultiAgent Architecture

Author: ADDIS A
ARMANO G
CHERCHI G
VARGIU E
Publication venue
Publication date: 01/01/2006
Field of study

In this paper, a system able to retrieve contents deemed relevant for the users through a text categorization process, is presented. The system is built exploiting a generic multiagent architecture that supports the implementation of applications aimed at (i) retrieving heterogeneous data spread among different sources (e.g., generic html pages, news, blogs, forums, and databases); (ii) filtering and organizing them according to personal interests explicitly stated by each user; (iii) providing adaptation techniques to improve and refine throughout time the profile of each selected user. In particular, the implemented multiagent system creates personalized press-revies from online newspapers. Preliminary results are encouraging and highlight the effectiveness of the approach

Archivio istituzionale della ricerca - Università di Cagliari