574 research outputs found

    A Survey of Existing E-mail Spam Filtering Methods Considering Machine Learning Techniques

    Get PDF
    E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learningbased technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system and summarize the overall scenario regarding accuracy rate of different existing approache

    An architectural-based approach to detecting spim in electronic means of communication

    Get PDF
    Spams are what users and developers should be aware of in all Internet-based communication tools (such as e-mail, websites, Social Networking Sites (SNS), instant messengers and so on). This is because spammers have not ceased from using these platforms to deceive and lure users into releasing vibrant and sensitive information (especially, financial details). This paper developed an architectural based technique for SPIM (Instant Message Spam or IM SPAM) detection using the classification method. The classification was done using the C4.5 classifier with a dataset of messages gotten from an instant messaging environment. The dataset served as the input to the classification algorithm method which was able to distinguish spam from non-spam messages. This classification method was depicted in a tree form to show its usefulness. The results show that its precision, recall and accuracy rate satisfied standard recommendation with a commendable error rate. The proposed technique will find implication in the reduction of the number of Internet users.Keywords: Social Networking sites, spammers, Instant message spam, C4.5 Classifiers, e-mails

    A pipeline and comparative study of 12 machine learning models for text classification

    Get PDF
    Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.Comment: This article has been accepted for publication in Expert Systems with Applications, April 2022. Published by Elsevier. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio

    Using word and phrase abbreviation patterns to extract age from Twitter microtexts

    Get PDF
    The wealth of texts available publicly online for analysis is ever increasing. Much work in computational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challenge for computational linguistic analysis due to their constrained size. The constraint of 140 characters often prompts users to abbreviate words and phrases. Additionally, as an informal writing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis can be done. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with some amount of success from many sources, using various methods, such as analysis of shallow linguistic patterns or topic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the pre-processing necessary for many computational linguistics analysis tasks. An added logistical challenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages are a part of an enormous data set, but the data set must be independently annotated for latent writer attributes not defined through the Twitter API before any classification on such attributes can be done. The actual classification problem is another particular challenge due to restrictions on tweet length. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client (iPhone, Android, Twitter website, etc.) used to make posts. Language change has generally been posited as being driven by women. This study explores if there there are age-related patterns or change in those patterns over time evident in Twitter posts from a variety of English authors. This work presents a growable data set annotated by Twitter users themselves for age and other useful attributes. The study also presents an extension of prior work on Twitter abbreviation patterns which shows that word and phrase abbreviation patterns can be used toward determining user age. Notable results include classification accuracy of up to 83%, which was 63% above relative majority class baseline (ZeroR in Weka) when classifying user ages into 6 equally sized age bins using a multilayer perceptron network classifier

    SMS Spam Detection in a Real-World Platform using Machine Learning

    Get PDF
    Spam detection techniques have made our lives easier by unclogging our inboxes and keeping unsafe messages from being opened. With the automation of text messaging solutions and the increase in telecommunication companies and message providers, the volume of text messages has been on the rise. With this growth came along malicious traffic which users had little control over. In this thesis, we present an implementation of a spam detection system in a real-world text messaging platform. Using well-established machine learning algorithms, we make an in-depth analysis on the performance of the models using two different datasets: one publicly available (N=5,574) and the other gathered from actual traffic of the platform (N=1,477). Making use of the empirical results, we outline the models and hyperparameters which can be used in the platform and in which scenarios they produce optimal performance. The results indicate that our dataset poses a great challenge at accurate classification, most likely due to the small sample size and unbalanced dataset, along with nuances in the dataset. Nevertheless, there were models that were found to have a good all-around performance and they can be trained and used in the platform

    Comparative performance of machine learning methods for classification on phishing attack detection

    Get PDF
    The development of computer networks today has increased rapidly. This can be shown based on the trend of every computer user around the world, whereby they need to connect their computer to the Internet. This indicates that the use of Internet is very important, such as for the access to social media accounts, namely Instagram, Facebook, and Twitter. However, with this extensive use, the Internet does not necessarily have the ability to maintain account security in mobile phones or computers. With a low level of security in a network system, it will be convenient for scammers to hack a victim’s computer system and retrieve all important information of the victim for their benefit There are many methods that used by scammers to get the important information where phishing attack is the simplest and famous method to be used. Therefore, this study was conducted to develop an anti-phishing method to detect the phishing attack. Machine learning method was proposed as suitable to be used in detecting phishing attacks. In this paper, several machine learning methods were studied and applied in detecting phishing attack. Experiments of the machine learning methods were conducted to investigate which method performed better. Two benchmark datasets were used in the interest to access the ability of the methods in detecting the phishing attack. Then the results were obtained to show the performance of each methods on all dataset
    • …
    corecore