420 research outputs found

    A systematic literature review on spam content detection and classification

    Get PDF
    The presence of spam content in social media is tremendously increasing, and therefore the detection of spam has become vital. The spam contents increase as people extensively use social media, i.e ., Facebook, Twitter, YouTube, and E-mail. The time spent by people using social media is overgrowing, especially in the time of the pandemic. Users get a lot of text messages through social media, and they cannot recognize the spam content in these messages. Spam messages contain malicious links, apps, fake accounts, fake news, reviews, rumors, etc. To improve social media security, the detection and control of spam text are essential. This paper presents a detailed survey on the latest developments in spam text detection and classification in social media. The various techniques involved in spam detection and classification involving Machine Learning, Deep Learning, and text-based approaches are discussed in this paper. We also present the challenges encountered in the identification of spam with its control mechanisms and datasets used in existing works involving spam detection

    Email phishing detection with BLSTM and word embeddings

    Get PDF
    The paper presents the email phishing detection method that uses BLSTM as a deep learning model. For feature extraction word embeddings ahs been used. Presented results demonstrate high accuracy and precision

    A COMPARISON OF MACHINE LEARNING TECHNIQUES: E-MAIL SPAM FILTERING FROM COMBINED SWAHILI AND ENGLISH EMAIL MESSAGES

    Get PDF
    The speed of technology change is faster now compared to the past ten to fifteen years. It changes the way people live and force them to use the latest devices to match with the speed. In communication perspectives nowadays, use of electronic mail (e-mail) for people who want to communicate with friends, companies or even the universities cannot be avoided. This makes it to be the most targeted by the spammer and hackers and other bad people who want to get the benefit by sending spam emails. The report shows that the amount of emails sent through the internet in a day can be more than 10 billion among these 45% are spams. The amount is not constant as sometimes it goes higher than what is noted here. This indicates clearly the magnitude of the problem and calls for the need for more efforts to be applied to reduce this amount and also minimize the effects from the spam messages. Various measures have been taken to eliminate this problem. Once people used social methods, that is legislative means of control and now they are using technological methods which are more effective and timely in catching spams as these work by analyzing the messages content. In this paper we compare the performance of machine learning algorithms by doing the experiment for testing English language dataset, Swahili language dataset individual and combined two dataset to form one, and results from combined dataset compared them with the Gmail classifier. The classifiers which the researcher used are Naïve Bayes (NB), Sequential Minimal Optimization (SMO) and k-Nearest Neighbour (k-NN). The results for combined dataset shows that SMO classifier lead the others by achieve 98.60% of accuracy, followed by k-NN classifier which has 97.20% accuracy, and Naïve Bayes classifier has 92.89% accuracy. From this result the researcher concludes that SMO classifier can work better in dataset that combined English and Swahili languages. In English dataset shows that SMO classifier leads other algorism, it achieved 97.51% of accuracy, followed by k-NN with average accuracy of 93.52% and the last but also good accuracy is Naïve Bayes that come with 87.78%. Swahili dataset Naïve Bayes lead others by getting 99.12% accuracy followed by SMO which has 98.69% and the last was k-NN which has 98.47%

    A Comparative Study of Word Embedding Techniques for SMS Spam Detection

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.E-mail and SMS are the most popular communication tools used by businesses, organizations and educational institutions. Every day, people receive hundreds of messages which could be either spam or ham. Spam is any form of unsolicited, unwanted digital communication, usually sent out in bulk. Spam emails and SMS waste resources by unnecessarily flooding network lines and consuming storage space. Therefore, it is important to develop high accuracy spam detection models to effectively block spam messages, so as to optimize resources and protect users. Various word-embedding techniques such as Bag of Words (BOW), N-grams, TF-IDF, Word2Vec and Doc2Vec have been widely applied to NLP problems, however a comparative study of these techniques for SMS spam detection is currently lacking. Hence, in this paper, we provide a comparative analysis of these popular word embedding techniques for SMS spam detection by evaluating their performance on a publicly available ham and spam dataset. We investigate the performance of the word embedding techniques using 5 different machine learning classifiers i.e. Multinomial Naive Bayes (MNB), KNN, SVM, Random Forest and Extra Trees. Based on the dataset employed in the study, N-gram, BOW and TF-IDF with oversampling recorded the highest F1 scores of 0.99 for ham and 0.94 for spam

    Improving spam email classification accuracy using ensemble techniques: a stacking approach

    Get PDF
    Spam emails pose a substantial cybersecurity danger, necessitating accurate classification to reduce unwanted messages and mitigate risks. This study focuses on enhancing spam email classification accuracy using stacking ensemble machine learning techniques.We trained and tested five classifiers: logistic regression, decision tree, K-nearest neighbors (KNN), Gaussian naive Bayes and AdaBoost. To address overfitting, two distinct datasets of spam emails were aggregated and balanced. Evaluating individual classifiers based on recall, precision and F1 score metrics revealed AdaBoost as the top performer. Considering evolving spam technology and new message types challenging traditional approaches, we propose a stacking method. By combining predictions from multiple base models, the stacking method aims to improve classification accuracy. The results demonstrate superior performance of the stacking method with the highest accuracy (98.8%), recall (98.8%) and F1 score (98.9%) among tested methods. Additional experiments validated our approach by varying dataset sizes and testing different classifier combinations. Our study presents an innovative combination of classifiers that significantly improves accuracy, contributing to the growing body of research on stacking techniques. Moreover, we compare classifier performances using a unique combination of two datasets, highlighting the potential of ensemble techniques, specifically stacking, in enhancing spam email classification accuracy. The implications extend beyond spam classification systems, offering insights applicable to other classification tasks. Continued research on emerging spam techniques is vital to ensure long-term effectiveness

    A Visual Framework for Graph and Text Analytics in Email Investigation

    Get PDF
    The aim of this work is to build a framework which can benefit from data analysis techniques to explore and mine important information stored in an email collection archive. The analysis of email data could be accomplished from different perspectives, we mainly focused our approach on two different aspects: social behaviors and the textual content of the emails body. We will present a review on the past techniques and features adopted to handle this type of analysis, and evaluate them in real tools. This background will motivate our choices and proposed approach, and help us build a final visual framework which can analyze and show social graph networks along with other data visualization elements that assist users in understanding and dynamically elaborating the email data uploaded. We will present the architecture and logical structure of the framework, and show the flexibility nature of the system for future integrations and improvements. The functional aspects of our approach will be tested using the ‘enron dataset’, and by applying real key actors involved in the ‘enron case’ scandal

    Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    Get PDF
    Analysts have estimated that more than 80 percent of today’s data is stored in unstructured form (e.g., text, audio, image, video)—much of it expressed in rich and ambiguous natural language. Traditionally, to analyze natural language, one has used qualitative data-analysis approaches, such as manual coding. Yet, the size of text data sets obtained from the Internet makes manual analysis virtually impossible. In this tutorial, we discuss the challenges encountered when applying automated text-mining techniques in information systems research. In particular, we showcase how to use probabilistic topic modeling via Latent Dirichlet allocation, an unsupervised text-mining technique, with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifact by automatically analyzing more than 12,000 online customer reviews. For fellow information systems researchers, this tutorial provides guidance for conducting text-mining studies on their own and for evaluating the quality of others
    • …
    corecore