27,047 research outputs found

    Feature extraction and classification of spam emails

    Get PDF

    Using online linear classifiers to filter spam Emails

    Get PDF
    The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering

    Writer Identification Using Inexpensive Signal Processing Techniques

    Full text link
    We propose to use novel and classical audio and text signal-processing and otherwise techniques for "inexpensive" fast writer identification tasks of scanned hand-written documents "visually". The "inexpensive" refers to the efficiency of the identification process in terms of CPU cycles while preserving decent accuracy for preliminary identification. This is a comparative study of multiple algorithm combinations in a pattern recognition pipeline implemented in Java around an open-source Modular Audio Recognition Framework (MARF) that can do a lot more beyond audio. We present our preliminary experimental findings in such an identification task. We simulate "visual" identification by "looking" at the hand-written document as a whole rather than trying to extract fine-grained features out of it prior classification.Comment: 9 pages; 1 figure; presented at CISSE'09 at http://conference.cisse2009.org/proceedings.aspx ; includes the the application source code; based on MARF described in arXiv:0905.123

    Stacking classifiers for anti-spam filtering of e-mail

    Full text link
    We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial e-mail, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in real-life applications

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Pigment Melanin: Pattern for Iris Recognition

    Full text link
    Recognition of iris based on Visible Light (VL) imaging is a difficult problem because of the light reflection from the cornea. Nonetheless, pigment melanin provides a rich feature source in VL, unavailable in Near-Infrared (NIR) imaging. This is due to biological spectroscopy of eumelanin, a chemical not stimulated in NIR. In this case, a plausible solution to observe such patterns may be provided by an adaptive procedure using a variational technique on the image histogram. To describe the patterns, a shape analysis method is used to derive feature-code for each subject. An important question is how much the melanin patterns, extracted from VL, are independent of iris texture in NIR. With this question in mind, the present investigation proposes fusion of features extracted from NIR and VL to boost the recognition performance. We have collected our own database (UTIRIS) consisting of both NIR and VL images of 158 eyes of 79 individuals. This investigation demonstrates that the proposed algorithm is highly sensitive to the patterns of cromophores and improves the iris recognition rate.Comment: To be Published on Special Issue on Biometrics, IEEE Transaction on Instruments and Measurements, Volume 59, Issue number 4, April 201
    corecore