1,104 research outputs found
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
A reputation framework for behavioural history: developing and sharing reputations from behavioural history of network clients
The open architecture of the Internet has enabled its massive growth and success by facilitating easy connectivity between hosts. At the same time, the Internet has also opened itself up to abuse, e.g. arising out of unsolicited communication, both intentional and unintentional. It remains an open question as to how best servers should protect themselves from malicious clients whilst offering good service to innocent clients. There has been research on behavioural profiling and reputation of clients, mostly at the network level and also for email as an application, to detect malicious clients. However, this area continues to pose open research challenges. This thesis is motivated by the need for a generalised framework capable of aiding efficient detection of malicious clients while being able to reward clients with behaviour profiles conforming to the acceptable use and other relevant policies. The main contribution of this thesis is a novel, generalised, context-aware, policy independent, privacy preserving framework for developing and sharing client reputation based on behavioural history. The framework, augmenting existing protocols, allows fitting in of policies at various stages, thus keeping itself open and flexible to implementation. Locally recorded behavioural history of clients with known identities are translated to client reputations, which are then shared globally. The reputations enable privacy for clients by not exposing the details of their behaviour during interactions with the servers. The local and globally shared reputations facilitate servers in selecting service levels, including restricting access to malicious clients. We present results and analyses of simulations, with synthetic data and some proposed example policies, of client-server interactions and of attacks on our model. Suggestions presented for possible future extensions are drawn from our experiences with simulation
- …