618 research outputs found
"May I borrow Your Filter?" Exchanging Filters to Combat Spam in a Community
Leveraging social networks in computer systems can be effective in dealing with a number of trust and security issues. Spam is one such issue where the "wisdom of crowds" can be harnessed by mining the collective knowledge of ordinary individuals. In this paper, we present a mechanism through which members of a virtual community can exchange information to combat spam. Previous attempts at collaborative spam filtering have concentrated on digest-based indexing techniques to share digests or fingerprints of emails that are known to be spam. We take a different approach and allow users to share their spam filters instead, thus dramatically reducing the amount of traffic generated in the network. The resultant diversity in the filters and cooperation in a community allows it to respond to spam in an autonomic fashion. As a test case for exchanging filters we use the popular SpamAssassin spam filtering software and show that exchanging spam filters provides an alternative method to improve spam filtering performance
BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology
This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Recommended from our members
System and Methods for Detecting Unwanted Voice Calls
Voice over IP (VoIP) is a key enabling technology for the migration of circuit-switched PSTN architectures to packet-based IP networks. However, this migration is successful only if the present problems in IP networks are addressed before deploying VoIP infrastructure on a large scale. One of the important issues that the present VoIP networks face is the problem of unwanted calls commonly referred to as SPIT (spam over Internet telephony). Mostly, these SPIT calls are from unknown callers who broadcast unwanted calls. There may be unwanted calls from legitimate and known people too. In this case, the unwantedness depends on social proximity of the communicating parties. For detecting these unwanted calls, I propose a framework that analyzes incoming calls for unwanted behavior. The framework includes a VoIP spam detector (VSD) that analyzes incoming VoIP calls for spam behavior using trust and reputation techniques. The framework also includes a nuisance detector (ND) that proactively infers the nuisance (or reluctance of the end user) to receive incoming calls. This inference is based on past mutual behavior between the calling and the called party (i.e., caller and callee), the callee's presence (mood or state of mind) and tolerance in receiving voice calls from the caller, and the social closeness between the caller and the callee. The VSD and ND learn the behavior of callers over time and estimate the possibility of the call to be unwanted based on predetermined thresholds configured by the callee (or the filter administrators). These threshold values have to be automatically updated for integrating dynamic behavioral changes of the communicating parties. For updating these threshold values, I propose an automatic calibration mechanism using receiver operating characteristics curves (ROC). The VSD and ND use this mechanism for dynamically updating thresholds for optimizing their accuracy of detection. In addition to unwanted calls to the callees in a VoIP network, there can be unwanted traffic coming into a VoIP network that attempts to compromise VoIP network devices. Intelligent hackers can create malicious VoIP traffic for disrupting network activities. Hence, there is a need to frequently monitor the risk levels of critical network infrastructure. Towards realizing this objective, I describe a network level risk management mechanism that prioritizes resources in a VoIP network. The prioritization scheme involves an adaptive re-computation model of risk levels using attack graphs and Bayesian inference techniques. All the above techniques collectively account for a domain-level VoIP security solution
Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System
In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how results obtained by our previous system in combination with the improved feature selection method outperforms classical machine learning techniques and other well-known lazy learning approaches. In order to evaluate the performance of all the analysed models, we employ two different corpus and six well-known metrics in various scenarios
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Multi-dimensional clustering in user profiling
User profiling has attracted an enormous number of technological methods and
applications. With the increasing amount of products and services, user profiling
has created opportunities to catch the attention of the user as well as achieving
high user satisfaction. To provide the user what she/he wants, when and how,
depends largely on understanding them. The user profile is the representation of
the user and holds the information about the user. These profiles are the
outcome of the user profiling.
Personalization is the adaptation of the services to meet the user’s needs and
expectations. Therefore, the knowledge about the user leads to a personalized
user experience. In user profiling applications the major challenge is to build and
handle user profiles. In the literature there are two main user profiling methods,
collaborative and the content-based. Apart from these traditional profiling
methods, a number of classification and clustering algorithms have been used
to classify user related information to create user profiles. However, the profiling,
achieved through these works, is lacking in terms of accuracy. This is because,
all information within the profile has the same influence during the profiling even
though some are irrelevant user information.
In this thesis, a primary aim is to provide an insight into the concept of user
profiling. For this purpose a comprehensive background study of the literature
was conducted and summarized in this thesis. Furthermore, existing user
profiling methods as well as the classification and clustering algorithms were investigated. Being one of the objectives of this study, the use of these
algorithms for user profiling was examined. A number of classification and
clustering algorithms, such as Bayesian Networks (BN) and Decision Trees
(DTs) have been simulated using user profiles and their classification accuracy
performances were evaluated. Additionally, a novel clustering algorithm for the
user profiling, namely Multi-Dimensional Clustering (MDC), has been proposed.
The MDC is a modified version of the Instance Based Learner (IBL) algorithm.
In IBL every feature has an equal effect on the classification regardless of their
relevance. MDC differs from the IBL by assigning weights to feature values to
distinguish the effect of the features on clustering. Existing feature weighing
methods, for instance Cross Category Feature (CCF), has also been
investigated. In this thesis, three feature value weighting methods have been
proposed for the MDC. These methods are; MDC weight method by Cross
Clustering (MDC-CC), MDC weight method by Balanced Clustering (MDC-BC)
and MDC weight method by changing the Lower-limit to Zero (MDC-LZ). All of
these weighted MDC algorithms have been tested and evaluated. Additional
simulations were carried out with existing weighted and non-weighted IBL
algorithms (i.e. K-Star and Locally Weighted Learning (LWL)) in order to
demonstrate the performance of the proposed methods. Furthermore, a real life scenario is implemented to show how the MDC can be used for the user
profiling to improve personalized service provisioning in mobile environments.
The experiments presented in this thesis were conducted by using user profile
datasets that reflect the user’s personal information, preferences and interests.
The simulations with existing classification and clustering algorithms (e.g. Bayesian Networks (BN), Naïve Bayesian (NB), Lazy learning of Bayesian
Rules (LBR), Iterative Dichotomister 3 (Id3)) were performed on the WEKA
(version 3.5.7) machine learning platform. WEKA serves as a workbench to
work with a collection of popular learning schemes implemented in JAVA. In
addition, the MDC-CC, MDC-BC and MDC-LZ have been implemented on
NetBeans IDE 6.1 Beta as a JAVA application and MATLAB. Finally, the real life
scenario is implemented as a Java Mobile Application (Java ME) on NetBeans
IDE 7.1. All simulation results were evaluated based on the error rate and
accuracy
- …