200 research outputs found
ReP-ETD: A Repetitive Preprocessing technique for Embedded Text Detection from images in spam emails
Email service proves to be a convenient and powerful communication tool. As internet continues to grow, the type of information available to user has shifted from text only to multimedia enriched. Embedded text in multimedia content is one of the prevalent means for delivering messages to content viewers. With the increasing importance of emails and the incursions of internet marketers, spam has become a major problem and has given rise to unwanted mails. Spammers are continuously adopting new techniques to evade detection. Image spam is one such technique where in embedded text within images carries the main information of the spam message instead of text based spam. Currently, image spam is evaluated to be roughly 50% of all spam traffic and is still on the rise, thus a serious research issue. Filtering mails is one of the popular approaches used to block spam mails. This work proposes new model ReP-ETD (Repetitive Pre-processing technique for Embedded Text Detection) for efficiently and accurately detecting spam in email images. The performance of the proposed ReP-ETD model has been evaluated across the identified parameters and compared with other existing models. The simulation results demonstrate the effectiveness of the proposed model
A review of spam email detection: analysis of spammer strategies and the dataset shift problem
.Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL
ReP-ETD: A Repetitive Preprocessing technique for Embedded Text Detection from images in spam emails
Camouflages and Token Manipulations-The Changing Faces of the Nigerian Fraudulent 419 Spammers
The inefficiencies of current spam filters against fraudulent (419) mails is not unrelated to the use by spammers of good-word
attacks, topic drifts, parasitic spamming, wrong categorization and recategorization of electronic mails by e-mail clients and of
course the fuzzy factors of greed and gullibility on the part of the recipients who responds to fraudulent spam mail offers. In this
paper, we establish that mail token manipulations remain, above any other tactics, the most potent tool used by Nigerian
scammers to fool statistical spam filters. While hoping that the uncovering of these manipulative evidences will prove useful in
future antispam research, our findings also sensitize spam filter developers on the need to inculcate within their antispam
architecture robust modules that can deal with the identified camouflages
Experimental Approach Based on Ensemble and Frequent Itemsets Mining for Image Spam Filtering
Excessive amounts of image spam cause many problems to e-mail users. Since image spam is difficult to detect using conventional text-based spam approach, various image processing techniques have been proposed. In this paper, we present an ensemble method using frequent itemset mining (FIM) for filtering image spam. Despite the fact that FIM techniques are well established in data mining, it is not commonly used in the ensemble method. In order to obtain a good filtering performance, a SIFT descriptor is used since it is widely known as effective image descriptors. K-mean clustering is applied to the SIFT keypoints which produce a visual codebook. The bag-of-word (BOW) feature vectors for each image is generated using a hard bag-of-features (HBOF) approach. FIM descriptors are obtained from the frequent itemsets of the BOW feature vectors. We combine BOW, FIM with another three different feature selections, namely Information Gain (IG), Symmetrical Uncertainty (SU) and Chi Square (CS) with a Spatial Pyramid in an ensemble method. We have performed experiments on Dredze and SpamArchive datasets. The results show that our ensemble that uses the frequent itemsets mining has significantly outperform the traditional BOW and naive approach that combines all descriptors directly in a very large single input vector
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Explainable Artificial Intelligence Applications in Cyber Security: State-of-the-Art in Research
This survey presents a comprehensive review of current literature on Explainable Artificial Intelligence (XAI) methods for cyber security applications. Due to the rapid development of Internet-connected systems and Artificial Intelligence in recent years, Artificial Intelligence including Machine Learning and Deep Learning has been widely utilized in the fields of cyber security including intrusion detection, malware detection, and spam filtering. However, although Artificial Intelligence-based approaches for the detection and defense of cyber attacks and threats are more advanced and efficient compared to the conventional signature-based and rule-based cyber security strategies, most Machine Learning-based techniques and Deep Learning-based techniques are deployed in the “black-box” manner, meaning that security experts and customers are unable to explain how such procedures reach particular conclusions. The deficiencies of transparencies and interpretability of existing Artificial Intelligence techniques would decrease human users’ confidence in the models utilized for the defense against cyber attacks, especially in current situations where cyber attacks become increasingly diverse and complicated. Therefore, it is essential to apply XAI in the establishment of cyber security models to create more explainable models while maintaining high accuracy and allowing human users to comprehend, trust, and manage the next generation of cyber defense mechanisms. Although there are papers reviewing Artificial Intelligence applications in cyber security areas and the vast literature on applying XAI in many fields including healthcare, financial services, and criminal justice, the surprising fact is that there are currently no survey research articles that concentrate on XAI applications in cyber security. Therefore, the motivation behind the survey is to bridge the research gap by presenting a detailed and up-to-date survey of XAI approaches applicable to issues in the cyber security field. Our work is the first to propose a clear roadmap for navigating the XAI literature in the context of applications in cyber security
MAXIMUM PHISH BAIT: TOWARDS FEATURE BASED DETECTION OF PHISING USING MAXIMUM ENTROPY CLASSIFICATION TECHNIQUE
Several antiphishing methods have been employed with the primary task of automatically apprehending and ruling out or
preventing phishing e-mail from users’ mail stream. Phishing attacks pose great threat to internet users and the extent can be
enormous if unchecked. Two major category techniques that have been shown to be useful for classifying e-mail messages
automatically include the rule based method which classifies email by using a set of heuristic rules and the statistical based
approach which model e-mails statistically usually under a machine learning framework. The statistical based methods have
been found in literature to outperform the rule based method.
This study proposes the use of the Maximum Entropy Model, a generative model and show how it can be used in antiphishing tasks. The model based feature proposed by Bergholz et al (2008) will also be adopted. This has been found to
outperform basic features proposed in previous studies. An experimental comparison of our approach with other generative
and non-generative classifiers is also proposed. This approach is expected to perform comparably better than others method
especially in the elimination of false positives
Hypersparse Neural Network Analysis of Large-Scale Internet Traffic
The Internet is transforming our society, necessitating a quantitative
understanding of Internet traffic. Our team collects and curates the largest
publicly available Internet traffic data containing 50 billion packets.
Utilizing a novel hypersparse neural network analysis of "video" streams of
this traffic using 10,000 processors in the MIT SuperCloud reveals a new
phenomena: the importance of otherwise unseen leaf nodes and isolated links in
Internet traffic. Our neural network approach further shows that a
two-parameter modified Zipf-Mandelbrot distribution accurately describes a wide
variety of source/destination statistics on moving sample windows ranging from
100,000 to 100,000,000 packets over collections that span years and continents.
The inferred model parameters distinguish different network streams and the
model leaf parameter strongly correlates with the fraction of the traffic in
different underlying network topologies. The hypersparse neural network
pipeline is highly adaptable and different network statistics and training
models can be incorporated with simple changes to the image filter functions.Comment: 11 pages, 10 figures, 3 tables, 60 citations; to appear in IEEE High
Performance Extreme Computing (HPEC) 201
- …