910 research outputs found

    On the use of Locality for Improving SVM-Based Spam Filtering

    Get PDF
    Recent growths in the use of email for communication and the corresponding growths in the volume of email received have made automatic processing of emails desirable. In tandem is the prevailing problem of Advance Fee fraud E-mails that pervades inboxes globally. These genres of e-mails solicit for financial transactions and funds transfers from unsuspecting users. Most modern mail-reading software packages provide some forms of programmable automatic filtering, typically in the form of sets of rules that file or otherwise dispose mails based on keywords detected in the headers or message body. Unfortunately programming these filters is an arcane and sometimes inefficient process. An adaptive mail system which can learn its users’ mail sorting preferences would therefore be more desirable. Premised on the work of Blanzieri & Bryl (2007), we proposes a framework dedicated to the phenomenon of locality in email data analysis of advance fee fraud e-mails which engages Support Vector Machines (SVM) classifier for building local decision rules into the classification process of the spam filter design for this genre of e-mails

    Evolutionary symbiotic feature selection for email spam detection

    Get PDF
    This work presents a symbiotic filtering approach enabling the exchange of relevant word features among different users in order to improve local anti-spam filters. The local spam filtering is based on a Content- Based Filtering strategy, where word frequencies are fed into a Naive Bayes learner. Several Evolutionary A l gori thms are expl ored f or f eature sel ecti on, i ncl udi ng the proposed symbi oti c exchange of the most rel evant featuresamong different users. Theexperimentswereconducted using anovel corpusbased on thewell known Enron datasets mixed with recent spam. The obtained results show that the symbiotic approach is competitive.Fundação para a Ciência e a Tecnologia (FCT) - FCOMP-01-0124-FEDER-022674COMPET

    Email spam detection : a symbiotic feature selection approach fostered by evolutionary computation

    Get PDF
    Post-print version (prior to journal publication)The electronic mail (email) is nowadays an essential communication service being widely used by most Internet users. One of the main problems affecting this service is the proliferation of unsolicited messages (usually denoted by spam) which, despite the efforts made by the research community, still remains as an inherent problem affecting this Internet service. In this perspective, this work proposes and explores the concept of a novel symbiotic feature selection approach allowing the exchange of relevant features among distinct collaborating users, in order to improve the behavior of anti-spam filters. For such purpose, several Evolutionary Algorithms (EA) are explored as optimization engines able to enhance feature selection strategies within the anti-spam area. The proposed mechanisms are tested using a realistic incremental retraining evaluation procedure and resorting to a novel corpus based on the well-known Enron datasets mixed with recent spam data. The obtained results show that the proposed symbiotic approach is competitive also having the advantage of preserving end-users privacy.The work of P. Cortez and P. Sousa was funded by FEDER, through the program COMPETE and the Portuguese Foundation for Science and Technology (FCT), within the project FCOMP-01-0124-FEDER-022674

    Detection of Offensive Tweets: A Comparative Study

    Get PDF
    With the growing popularity, Twitter has become a major platform for posting views via tweets. Tweets contain useful, relevant and offensive content as well. More than a decade of research has resulted in numerous techniques and models to detect offensive content. However, little is known about lexically offensive and contextual offensive content. In this research paper, lexical offensive contents have been identified using two techniques- Rule-Based Naive Bayes (RNB) and a collaborative model of LDA with Naïve Bayes (LDANB). LDANB provides better results as compared to RNB for lexical offensive tweet detection. Further, contextually offensive contents are detected using newly devised Adjective Based approach. Contextual offensive content results prove to be better with Adjective based approach than Cosine similarity based results. To validate results of applied offensive tweet detection techniques three performance metrics- precision, Accuracy and recall are used

    Investigation into the Application of Personality Insights and Language Tone Analysis in Spam Classification

    Get PDF
    Due to its persistence spam remains as one of the biggest problems facing users and suppliers of email communication services. Machine learning techniques have been very successful at preventing many spam mails from arriving in user mailboxes, however they still account for over 50% of all emails sent. Despite this relative success the economic cost of spam has been estimated as high as 50billionin2005andmorerecentlyat50 billion in 2005 and more recently at 20 billion so spam can still be considered a considerable problem. In essence a spam email is a commercial communication trying to entice the receiver to take some positive action. This project uses the text from emails and creates personality insight and language tone scores through the use of IBM Watsons’ Tone Analyzer API. Those scores are used to investigate whether the language used in emails can be transformed into useful features that can be used to correctly classify them as spam or genuine emails. And during the course of this investigation a range of machine learning techniques are applied. Results from this experiment found that where just the personality insight and language tone features are used in the model some promising results with one dataset were shown. However over all datasets results were inconclusive with this model. Furthermore it was found that in a model where these features were used in combination with a normalised term-frequency feature-set no real improvement in the classification performance was shown

    Single-Class Learning for Spam Filtering: An Ensemble Approach

    Get PDF
    Spam, also known as Unsolicited Commercial Email (UCE), has been an increasingly annoying problem to individuals and organizations. Most of prior research formulated spam filtering as a classical text categorization task, in which training examples must include both spam emails (positive examples) and legitimate mails (negatives). However, in many spam filtering scenarios, obtaining legitimate emails for training purpose is more difficult than collecting spam and unclassified emails. Hence, it would be more appropriate to construct a classification model for spam filtering from positive (i.e., spam emails) and unlabeled instances only; i.e., training a spam filter without any legitimate emails as negative training examples. Several single-class learning techniques that include PNB and PEBL have been proposed in the literature. However, they incur fundamental limitations when applying to spam filtering. In this study, we propose and develop an ensemble approach, referred to as E2, to address the limitations of PNB and PEBL. Specifically, we follow the two-stage framework of PEBL and extend each stage with an ensemble strategy. Our empirical evaluation results on two spam-filtering corpora suggest that the proposed E2 technique exhibits more stable and reliable performance than its benchmark techniques (i.e., PNB and PEBL)

    Spam Detection Using Machine Learning and Deep Learning

    Get PDF
    Text messages are essential these days; however, spam texts have contributed negatively to the success of this communication mode. The compromised authenticity of such messages has given rise to several security breaches. Using spam messages, malicious links have been sent to either harm the system or obtain information detrimental to the user. Spam SMS messages as well as emails have been used as media for attacks such as masquerading and smishing ( a phishing attack through text messaging), and this has threatened both the user and service providers. Therefore, given the waves of attacks, the need to identify and remove these spam messages is important. This dissertation explores the process of text classification from data input to embedded representation of the words in vector form and finally the classification process. Therefore, we have applied different embedding methods to capture both the linguistic and semantic meanings of words. Static embedding methods that are used include Word to Vector (Word2Vec) and Global Vectors (GloVe), while for dynamic embedding the transfer learning of the Bidirectional Encoder Representations from Transformers (BERT) was employed. For classification, both machine learning and deep learning techniques were used to build an efficient and sensitive classification model with good accuracy and low false positive rate. Our result established that the combination of BERT for embedding and machine learning for classification produced better classification results than other combinations. With these results, we developed models that combined the self-feature extraction advantage of deep learning and the effective classification of machine learning. These models were tested on four different datasets, namely: SMS Spam dataset, Ling dataset, Spam Assassin dataset and Enron dataset. BERT+SVC (hybrid model) produced the result with highest accuracy and lowest false positive rate
    • …
    corecore