4 research outputs found
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Towards an Effective Organization-Wide Bulk Email System
Bulk email is widely used in organizations to communicate messages to
employees. It is an important tool in making employees aware of policies,
events, leadership updates, etc. However, in large organizations, the problem
of overwhelming communication is widespread. Ineffective organizational bulk
emails waste employees' time and organizations' money, and cause a lack of
awareness or compliance with organizations' missions and priorities. This
thesis focuses on improving organizational bulk email systems by 1) conducting
qualitative research to understand different stakeholders; 2) conducting field
studies to evaluate personalization's effects on getting employees to read bulk
messages; 3) designing tools to support communicators in evaluating bulk
emails. We performed these studies at the University of Minnesota, interviewing
25 employees (both senders and recipients), and including 317 participants in
total. We found that the university's current bulk email system is ineffective
as only 22% of the information communicated was retained by employees. To
encourage employees to read high-level information, we implemented a
multi-stakeholder personalization framework that mixed
important-to-organization messages with employee-preferred messages and
improved the studied bulk email's recognition rate by 20%. On the sender side,
we iteratively designed a prototype of a bulk email evaluation platform. In
field evaluation, we found bulk emails' message-level performance helped
communicators in designing bulk emails. We collected eye-tracking data and
developed a neural network technique to estimate how much time each message is
being read using recipients' interactions with browsers only, which improved
the estimation accuracy to 73%. In summary, this work sheds light on how to
design organizational bulk email systems that communicate effectively and
respect different stakeholders' value.Comment: PhD Thesi