27,047 research outputs found
Using online linear classifiers to filter spam Emails
The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering
Writer Identification Using Inexpensive Signal Processing Techniques
We propose to use novel and classical audio and text signal-processing and
otherwise techniques for "inexpensive" fast writer identification tasks of
scanned hand-written documents "visually". The "inexpensive" refers to the
efficiency of the identification process in terms of CPU cycles while
preserving decent accuracy for preliminary identification. This is a
comparative study of multiple algorithm combinations in a pattern recognition
pipeline implemented in Java around an open-source Modular Audio Recognition
Framework (MARF) that can do a lot more beyond audio. We present our
preliminary experimental findings in such an identification task. We simulate
"visual" identification by "looking" at the hand-written document as a whole
rather than trying to extract fine-grained features out of it prior
classification.Comment: 9 pages; 1 figure; presented at CISSE'09 at
http://conference.cisse2009.org/proceedings.aspx ; includes the the
application source code; based on MARF described in arXiv:0905.123
Stacking classifiers for anti-spam filtering of e-mail
We evaluate empirically a scheme for combining classifiers, known as stacked
generalization, in the context of anti-spam filtering, a novel cost-sensitive
application of text categorization. Unsolicited commercial e-mail, or "spam",
floods mailboxes, causing frustration, wasting bandwidth, and exposing minors
to unsuitable content. Using a public corpus, we show that stacking can improve
the efficiency of automatically induced anti-spam filters, and that such
filters can be used in real-life applications
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Pigment Melanin: Pattern for Iris Recognition
Recognition of iris based on Visible Light (VL) imaging is a difficult
problem because of the light reflection from the cornea. Nonetheless, pigment
melanin provides a rich feature source in VL, unavailable in Near-Infrared
(NIR) imaging. This is due to biological spectroscopy of eumelanin, a chemical
not stimulated in NIR. In this case, a plausible solution to observe such
patterns may be provided by an adaptive procedure using a variational technique
on the image histogram. To describe the patterns, a shape analysis method is
used to derive feature-code for each subject. An important question is how much
the melanin patterns, extracted from VL, are independent of iris texture in
NIR. With this question in mind, the present investigation proposes fusion of
features extracted from NIR and VL to boost the recognition performance. We
have collected our own database (UTIRIS) consisting of both NIR and VL images
of 158 eyes of 79 individuals. This investigation demonstrates that the
proposed algorithm is highly sensitive to the patterns of cromophores and
improves the iris recognition rate.Comment: To be Published on Special Issue on Biometrics, IEEE Transaction on
Instruments and Measurements, Volume 59, Issue number 4, April 201
- …