12 research outputs found
A Pragmatic Approach to DHT Adoption
Despite the peer-to-peer community's obvious wish to have its systems adopted, specific mechanisms to facilitate incremental adoption have not yet received the same level of attention as the many other practical concerns associated with these systems. This paper argues that ease of adoption should be elevated to a first-class concern and accordingly presents HOLD, a front-end to existing DHTs that is optimized for incremental adoption. Specifically, HOLD is backwards-compatible: it leverages DNS to provide a key-based routing service to existing Internet hosts without requiring them to install any software. This paper also presents applications that could benefit from HOLD as well as the trade-offs that accompany HOLD. Early implementation experience suggests that HOLD is practical
On the use of Locality for Improving SVM-Based Spam Filtering
Recent growths in the use of email for communication and the corresponding growths in the volume of email received have made automatic processing of emails desirable. In tandem is the prevailing problem of Advance Fee fraud E-mails that pervades inboxes globally. These genres of e-mails solicit for financial transactions and funds transfers from unsuspecting users. Most modern mail-reading software packages provide some forms of programmable automatic filtering, typically in the form of sets of rules that file or otherwise dispose mails based on keywords detected in the headers or message body. Unfortunately programming these filters is an arcane and sometimes inefficient process. An adaptive mail system which can learn its users’ mail sorting preferences would therefore be more desirable. Premised on the work of Blanzieri & Bryl (2007), we proposes a framework dedicated to the phenomenon of locality in email data analysis of advance fee fraud e-mails which engages Support Vector Machines (SVM) classifier for building local decision rules into the classification process of the spam filter design for this genre of e-mails
Let Your CyberAlter Ego Share Information and Manage Spam
Almost all of us have multiple cyberspace identities, and these {\em
cyber}alter egos are networked together to form a vast cyberspace social
network. This network is distinct from the world-wide-web (WWW), which is being
queried and mined to the tune of billions of dollars everyday, and until
recently, has gone largely unexplored. Empirically, the cyberspace social
networks have been found to possess many of the same complex features that
characterize its real counterparts, including scale-free degree distributions,
low diameter, and extensive connectivity. We show that these topological
features make the latent networks particularly suitable for explorations and
management via local-only messaging protocols. {\em Cyber}alter egos can
communicate via their direct links (i.e., using only their own address books)
and set up a highly decentralized and scalable message passing network that can
allow large-scale sharing of information and data. As one particular example of
such collaborative systems, we provide a design of a spam filtering system, and
our large-scale simulations show that the system achieves a spam detection rate
close to 100%, while the false positive rate is kept around zero. This system
has several advantages over other recent proposals (i) It uses an already
existing network, created by the same social dynamics that govern our daily
lives, and no dedicated peer-to-peer (P2P) systems or centralized server-based
systems need be constructed; (ii) It utilizes a percolation search algorithm
that makes the query-generated traffic scalable; (iii) The network has a built
in trust system (just as in social networks) that can be used to thwart
malicious attacks; iv) It can be implemented right now as a plugin to popular
email programs, such as MS Outlook, Eudora, and Sendmail.Comment: 13 pages, 10 figure
Resolving FP-TP Conflict in Digest-Based Collaborative Spam Detection by Use of Negative Selection Algorithm
A well-known approach for collaborative spam filtering is to determine which emails belong to the same bulk, e.g. by exploiting their content similarity. This allows, after observing an initial portion of a bulk, for the bulkiness scores to be assigned to the remaining emails from the same bulk. This also allows the individual evidence of spamminess to be joined, if such evidence is generated by collaborating filters or users for some of the emails from an initial portion of the bulk. Usually a database of previously observed emails or email digests is formed and queried upon receiving new emails. Previous evaluations [2,10] of the approach based on the email digests that preserve email content similarity indicate and partially demonstrate that there are ways to make the approach robust to increased obfuscation efforts by spammers. However, for the settings of the parameters that provide good matching between the emails from the same bulk, the unwanted random matching between ham emails and unrelated ham and spam emails stays rather high. This directly translates into a need for use of higher bulkiness thresholds in order to ensure low false positive (FP) detection of ham, which implies that larger initial parts of spam bulks will not be filtered, i.e. true positive (TP) detection will not be very high (FP-TP conflict). In this paper we demonstrate how, by use of the negative selection algorithm, the unwanted random matching between unrelated emails may be decreased at least by an order of magnitude, while preserving the same good matching between the emails from the same bulk. We also show how this translates into an order of magnitude (at least) of less undetected bulky spam emails, under the same ham miss- detection requirements
Blacklight: Defending Black-Box Adversarial Attacks on Deep Neural Networks
The vulnerability of deep neural networks (DNNs) to adversarial examples is
well documented. Under the strong white-box threat model, where attackers have
full access to DNN internals, recent work has produced continual advancements
in defenses, often followed by more powerful attacks that break them.
Meanwhile, research on the more realistic black-box threat model has focused
almost entirely on reducing the query-cost of attacks, making them increasingly
practical for ML models already deployed today.
This paper proposes and evaluates Blacklight, a new defense against black-box
adversarial attacks. Blacklight targets a key property of black-box attacks: to
compute adversarial examples, they produce sequences of highly similar images
while trying to minimize the distance from some initial benign input. To detect
an attack, Blacklight computes for each query image a compact set of one-way
hash values that form a probabilistic fingerprint. Variants of an image produce
nearly identical fingerprints, and fingerprint generation is robust against
manipulation. We evaluate Blacklight on 5 state-of-the-art black-box attacks,
across a variety of models and classification tasks. While the most efficient
attacks take thousands or tens of thousands of queries to complete, Blacklight
identifies them all, often after only a handful of queries. Blacklight is also
robust against several powerful countermeasures, including an optimal black-box
attack that approximates white-box attacks in efficiency. Finally, Blacklight
significantly outperforms the only known alternative in both detection coverage
of attack queries and resistance against persistent attackers
Approximate Object Location and Spam Filtering on Peer-to-peer Systems
Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors instead of content-hashed GUIDs, which enables the systems to locate similar objects. We discuss the design of a distributed text similarity engine, named Approximate Text Addressing (ATA), built on top of this extension that locates objects by their text descriptions. We then outline the design and implementation of a motivating application on ATA, a decentralized spam-filtering service
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Designs and Analyses in Structured Peer-To-Peer Systems
Peer-to-Peer (P2P) computing is a recent hot topic in the areas of networking and distributed systems. Work on P2P computing was triggered by a number of ad-hoc systems that made the concept popular. Later, academic research efforts started to investigate P2P computing issues based on scientific principles. Some of that research produced a number of structured P2P systems that were collectively referred to by the term "Distributed Hash Tables" (DHTs). However, the research occurred in a diversified way leading to the appearance of similar concepts yet lacking a common perspective and not heavily analyzed. In this thesis we present a number of papers representing our research results in the area of structured P2P systems grouped as two sets labeled respectively "Designs" and "Analyses".
The contribution of the first set of papers is as follows. First, we present the princi- ple of distributed k-ary search and argue that it serves as a framework for most of the recent P2P systems known as DHTs. That is, given this framework, understanding existing DHT systems is done simply by seeing how they are instances of that frame- work. We argue that by perceiving systems as instances of that framework, one can optimize some of them. We illustrate that by applying the framework to the Chord system, one of the most established DHT systems. Second, we show how the frame- work helps in the design of P2P algorithms by two examples: (a) The DKS(n; k; f) system which is a system designed from the beginning on the principles of distributed k-ary search. (b) Two broadcast algorithms that take advantage of the distributed k-ary search tree.
The contribution of the second set of papers is as follows. We account for two approaches that we used to evaluate the performance of a particular class of DHTs, namely the one adopting periodic stabilization for topology maintenance. The first approach was of an intrinsic empirical nature. In this approach, we tried to perceive a DHT as a physical system and account for its properties in a size-independent manner. The second approach was of a more analytical nature. In this approach, we applied the technique of Master Equations, which is a widely used technique in the analysis of natural systems. The application of the technique lead to a highly accurate description of the behavior of structured overlays. Additionally, the thesis contains a primer on structured P2P systems that tries to capture the main ideas prevailing in the field