Search CORE

7 research outputs found

Spam Filter Improvement Through Measurement

Author: Lynam Thomas Richard
Publication venue: 'University of Waterloo'
Publication date: 01/01/2009
Field of study

This work supports the thesis that sound quantitative evaluation for spam filters leads to substantial improvement in the classification of email. To this end, new laboratory testing methods and datasets are introduced, and evidence is presented that their adoption at Text REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art spam filtering. While many of these improvements have been discovered by others, the best-performing method known at this time -- spam filter fusion -- was demonstrated by the author. This work describes four principal dimensions of spam filter evaluation methodology and spam filter improvement. An initial study investigates the application of twelve open-source filter configurations in a laboratory environment, using a stream of 50,000 messages captured from a single recipient over eight months. The study measures the impact of user feedback and on-line learning on filter performance using methodology and measures which were released to the research community as the TREC Spam Filter Evaluation Toolkit. The toolkit was used as the basis of the TREC Spam Track, which the author co-founded with Cormack. The Spam Track, in addition to evaluating a new application (email spam), addressed the issue of testing systems on both private and public data. While streams of private messages are most realistic, they are not easy to come by and cannot be shared with the research community as archival benchmarks. Using the toolkit, participant filters were evaluated on both, and the differences found not to substantially confound evaluation; as a result, public corpora were validated as research tools. Over the course of TREC and similar evaluation efforts, a dozen or more archival benchmarks -- some private and some public -- have become available. The toolkit and methodology have spawned improvements in the state of the art every year since its deployment in 2005. In 2005, 2006, and 2007, the spam track yielded new best-performing systems based on sequential compression models, orthogonal sparse bigram features, logistic regression and support vector machines. Using the TREC participant filters, we develop and demonstrate methods for on-line filter fusion that outperform all other reported on-line personal spam filters

University of Waterloo's Institutional Repository

A review of spam email detection: analysis of spammer strategies and the dataset shift problem

Author: Alaiz Rodríguez Rocío
Alegre Gutiérrez Enrique
González Castro Víctor
Jáñez-Martino Francisco
López Fidalgo Eduardo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/06/2022
Field of study

.Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

Leon University (Spain)

An automatic email mining approach using semantic non-parametric K-Means++ clustering

Author: Soni Gunjan
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2013
Field of study

Email inboxes are now filled with huge varieties of voluminous messages and thus increasing the problem of email overload which places financial burden on companies and individuals. Email mining provides solution to email overload problem by automatically grouping emails into meaningful and similar groups based on email subjects and contents. Existing email mining systems such as Kernel-Selected clustering and BuzzTrack, do not consider the semantic similarity between email contents, also when large number of email messages are clustered to a single folder they retain the problem of email overload. This thesis proposes a system named AEMS for automatic folder and sub-folder creation, indexing of the created folders with link to each folder and sub-folder, also an Apriori-based folder summarization containing important keywords from the folder. Thesis aims at solving email overload problem through semantic re-structuring of emails. In AEMS model, a novel approach named Semantic Non-parametric K-Means++ clustering is proposed for folder creation, which avoids, (1) random seed selection by selecting the seed according to email weights, and (2) pre-defined number of clusters using the similarity between the email contents. Experiments show the effectiveness and efficiency of the proposed techniques using large volumes of email datasets. Keywords: Email Mining, Email Overload, Email Management, Data Mining, Clustering, Feature Selection, Folder Summarization

Scholarship at UWindsor

Recommended from our members

Toward practical and private online services

Author: Gupta Trinabh
Publication venue
Publication date: 31/01/2018
Field of study

Today's common online services (social networks, media streaming, messaging, email, etc.) bring convenience. However, these services are susceptible to privacy leaks. Certainly, email snooping by rogue employees, email server hacks, and accidental disclosures of user ratings for movies are some sources of private information leakage. This dissertation investigates the following question: Can we build systems that (a) provide strong privacy guarantees to the users, (b) are consistent with existing commercial and policy regimes, and (c) are affordable? Satisfying all three requirements simultaneously is challenging, as providing strong privacy guarantees usually necessitates either sacrificing functionality, incurring high resource costs, or both. Indeed, there are powerful cryptographic protocols---private information retrieval (PIR), and secure two-party computation (2PC)---that provide strong guarantees but are orders of magnitude more expensive than their non-private counterparts. This dissertation takes these protocols as a starting point and then substantially reduces their costs by tailoring them using application-specific properties. It presents two systems, Popcorn and Pretzel, built on this design ethos. Popcorn is a Netflix-like media delivery system, that provably hides, even from the content distributor (for example, Netflix), which movie a user is watching. Popcorn tailors PIR protocols to the media domain. It amortizes the server-side overhead of PIR by batching requests from the large number of concurrent users retrieving content at any given time; and, it forms large batches without introducing playback delays by leveraging the properties of media streaming. Popcorn is consistent with the prevailing commercial regime (copyrights, etc.), and its per-request dollar cost is 3.87 times that of a non-private system. The other system described in this dissertation, Pretzel, is an email system that encrypts emails end-to-end between senders and intended recipients, but allows the email service provider to perform content-based spam filtering and targeted advertising. Pretzel refines a 2PC protocol. It reduces the resource consumption of the protocol by replacing the underlying encryption scheme with a more efficient one, applying a packing technique to conserve invocations of the encryption algorithm, and pruning the inputs to the protocol. Pretzel's costs, versus a legacy non-private implementation, are estimated to be up to 5.4 times for the email provider, with additional but modest client-side requirements. Popcorn and Pretzel have fundamental connections. For instance, the cryptographic protocols in both systems securely compute vector-matrix products. However, we observe that differences in the vector and matrix dimensions lead to different system designs. Ultimately, both systems represent a potentially appealing compromise: sacrifice some functionality to build in strong privacy properties at affordable costs.Computer Science

Texas ScholarWorks

Statistical Analysis of Spherical Data: Clustering, Feature Selection and Applications

Author: Amayri Ola
Publication venue
Publication date: 28/09/2014
Field of study

In the light of interdisciplinary applications, data to be studied and analyzed have witnessed a growth in volume and change in their intrinsic structure and type. In other words, in practice the diversity of resources generating objects have imposed several challenges for decision maker to determine informative data in terms of time, model capability, scalability and knowledge discovery. Thus, it is highly desirable to be able to extract patterns of interest that support the decision of data management. Clustering, among other machine learning approaches, is an important data engineering technique that empowers the automatic discovery of similar object’s clusters and the consequent assignment of new unseen objects to appropriate clusters. In this context, the majority of current research does not completely address the true structure and nature of data for particular application at hand. In contrast to most previous research, our proposed work focuses on the modeling and classification of spherical data that are naturally generated in many data mining and knowledge discovery applications. Thus, in this thesis we propose several estimation and feature selection frameworks based on Langevin distribution which are devoted to spherical patterns in offline and online settings. In this thesis, we first formulate a unified probabilistic framework, where we build probabilistic kernels based on Fisher score and information divergences from finite Langevin mixture for Support Vector Machine. We are motivated by the fact that the blending of generative and discriminative approaches has prevailed by exploring and adopting distinct characteristic of each approach toward constructing a complementary system combining the best of both. Due to the high demand to construct compact and accurate statistical models that are automatically adjustable to dynamic changes, next in this thesis, we propose probabilistic frameworks for high-dimensional spherical data modeling based on finite Langevin mixtures that allow simultaneous clustering and feature selection in offline and online settings. To this end, we adopted finite mixture models which have long been heavily relied on deterministic learning approaches such as maximum likelihood estimation. Despite their successful utilization in wide spectrum of areas, these approaches have several drawbacks as we will discuss in this thesis. An alternative approach is the adoption of Bayesian inference that naturally addresses data uncertainty while ensuring good generalization. To address this issue, we also propose a Bayesian approach for finite Langevin mixture model estimation and selection. When data change dynamically and grow drastically, finite mixture is not always a feasible solution. In contrast with previous approaches, which suppose an unknown finite number of mixture components, we finally propose a nonparametric Bayesian approach which assumes an infinite number of components. We further enhance our model by simultaneously detecting informative features in the process of clustering. Through extensive empirical experiments, we demonstrate the merits of the proposed learning frameworks on diverse high dimensional datasets and challenging real-world applications

Concordia University Research Repository

1 SpamBayes A TREC along the Spam Track with SpamBayes

Author: Tony Andrew Meyer
Publication venue
Publication date
Field of study

This paper describes the SpamBayes submissions made to the Spam Track of the 2005 Text Retrieval Conference (TREC). SpamBayes is briefly introduced, but the paper focuses more on how the submissions differ from the standard installation. Unlike in the majority of earlier publications evaluating the effectiveness of SpamBayes, the fundamental ‘unsure ’ range is discussed, and the method of removing the range is outlined. Finally, an analysis of the results of the running the four submissions through the Spam Track ‘jig ’ with the three private corpora and one public corpus is made. SpamBayes [1] was born on August 19th 2002, soon after publication of A Plan for Spam [2]; Tim Peters and others involved with the Python development community developed code based on Graham’s ideas, with the initial aim of filtering python.org mailing-list traffic, although this quickly progressed to also filtering personal email streams (today, although most python.org mailing lists do use SpamBayes, the overwhelmingly most common use of SpamBayes is for personal email filtering). Although the project initially started with Graham’s original combining scheme, it currently uses the chi-squared combining scheme developed by Robinson. The SpamBayes distribution includes a plug-in for Microsoft ™ Outlook™, a generic POP3 proxy and IMAP4 filter, various command-line scripts, and a suite of tools for testing modifications of the tokenization/classification systems. The software is free 1, released under the Python Software Foundation Licence. A basic introduction to the distribution can be found in [3]. The testing outlined in this paper used version 1.1a1; the only modifications were adjustments of the client/server architecture to improve the speed of the TREC testing

CiteSeerX