10 research outputs found

    An Improved Transformer-based Model for Detecting Phishing, Spam, and Ham: A Large Language Model Approach

    Full text link
    Phishing and spam detection is long standing challenge that has been the subject of much academic research. Large Language Models (LLM) have vast potential to transform society and provide new and innovative approaches to solve well-established challenges. Phishing and spam have caused financial hardships and lost time and resources to email users all over the world and frequently serve as an entry point for ransomware threat actors. While detection approaches exist, especially heuristic-based approaches, LLMs offer the potential to venture into a new unexplored area for understanding and solving this challenge. LLMs have rapidly altered the landscape from business, consumers, and throughout academia and demonstrate transformational potential for the potential of society. Based on this, applying these new and innovative approaches to email detection is a rational next step in academic research. In this work, we present IPSDM, our model based on fine-tuning the BERT family of models to specifically detect phishing and spam email. We demonstrate our fine-tuned version, IPSDM, is able to better classify emails in both unbalanced and balanced datasets. This work serves as an important first step towards employing LLMs to improve the security of our information systems

    A Semi-Supervised Learning Approach for Tackling Twitter Spam Drift

    Get PDF
    Twitter has changed the way people get information by allowing them to express their opinion and comments on the daily tweets. Unfortunately, due to the high popularity of Twitter, it has become very attractive to spammers. Unlike other types of spam, Twitter spam has become a serious issue in the last few years. The large number of users and the high amount of information being shared on Twitter play an important role in accelerating the spread of spam. In order to protect the users, Twitter and the research community have been developing different spam detection systems by applying different machine-learning techniques. However, a recent study showed that the current machine learning-based detection systems are not able to detect spam accurately because spam tweet characteristics vary over time. This issue is called “Twitter Spam Drift”. In this paper, a semi-supervised learning approach (SSLA) has been proposed to tackle this. The new approach uses the unlabeled data to learn the structure of the domain. Different experiments were performed on English and Arabic datasets to test and evaluate the proposed approach and the results show that the proposed SSLA can reduce the effect of Twitter spam drift and outperform the existing techniques

    A Fake Profile Detection Model Using Multistage Stacked Ensemble Classification

    Get PDF
    Fake profile identification on social media platforms is essential for preserving a reliable online community. Previous studies have primarily used conventional classifiers for fake account identification on social networking sites, neglecting feature selection and class balancing to enhance performance. This study introduces a novel multistage stacked ensemble classification model to enhance fake profile detection accuracy, especially in imbalanced datasets. The model comprises three phases: feature selection, base learning, and meta-learning for classification. The novelty of the work lies in utilizing chi-squared feature-class association-based feature selection, combining stacked ensemble and cost-sensitive learning. The research findings indicate that the proposed model significantly enhances fake profile detection efficiency. Employing cost-sensitive learning enhances accuracy on the Facebook, Instagram, and Twitter spam datasets with 95%, 98.20%, and 81% precision, outperforming conventional and advanced classifiers. It is demonstrated that the proposed model has the potential to enhance the security and reliability of online social networks, compared with existing models

    Une exploration des messages Twitter Ă©mis par les gouvernements en temps de COVID-19

    Get PDF
    Les gouvernements se tournent de plus en plus vers les plateformes de médias sociaux telles que Twitter pour diffuser des informations liées à la santé publique au grand public, comme en témoigne la pandémie de la COVID-19. Le but de cet article est de mieux comprendre l’utilisation de Twitter par le gouvernement et les responsables de la santé publique Canadiens comme plate-forme de diffusion de messages pendant la pandémie, et d’explorer l’engagement et le sentiment du public à l’égard de ces messages diffusés. Nous avons examiné les données de 93 comptes Twitter de responsables de la santé publique et du gouvernement au Canada pendant la première vague de la pandémie (du 31 décembre 2019 au 31 août 2020). Nos objectifs étaient les suivants: 1) déterminer les taux d'engagement du public auprès des publications Twitter des gouvernements fédéral et provinciaux/territoriaux du Canada et des responsables de la santé publique, 2) illustrer l'évolution du discours public Canadien pendant la première vague de la pandémie par des tendances d’hashtag et 3) fournir un aperçu de la réaction du public aux tweets des autorités Canadiennes grâce à une analyse des sentiments. Pour atteindre ces objectifs, nous avons extrait des publications Twitter, ainsi que les réponses et les métadonnées qui leurs sont associées, en langue anglaise et française, pendant toute la période de l’étude. Nos résultats suggèrent que les membres du public ont démontré un engagement accru envers les comptes Twitter des officiels fédéraux par rapport aux comptes Twitter des officiels provinciaux et territoriaux. Les analyses des tendances des hashtag ont illustré le changement de sujet dans le discours public Canadien, qui était initialement axé sur les stratégies d'atténuation de la COVID-19 et qui a évolué pour aborder des problèmes émergents tels que les effets de la COVID-19 sur la santé mentale. De plus, nous avons identifié 11 sentiments en réponse aux publications des officiels relatifs à la COVID-19. Cette étude illustre le potentiel de tirer parti des médias sociaux pour comprendre le discours public pendant une pandémie. Nous suggérons que des analyses routinières de telles données peuvent fournir des recommandations en temps réel aux gouvernements et aux responsables de la santé publique sur les sentiments du public lors d'une urgence de santé publique et peuvent aussi fournir des informations utiles sur les comptes/acteurs avec lesquels les membres du public sont le plus engagés, ce qui peut être mis à profit pour diffuser des messages clés.Governments are increasingly turning to social media platforms such as Twitter to disseminate public health information to the public, as evidenced during the COVID-19 pandemic. The purpose of this paper is to gain a better understanding of Canadian government and public health officials' use of Twitter as a dissemination platform during the pandemic, and to explore the public's engagement with and sentiment towards these messages. We examined the account data of 93 Canadian public health and government officials during the first wave of the pandemic (December 31, 2019 – August 31, 2020). Our objectives were to: 1) determine the engagement rates of the public with Canadian federal and provincial/territorial governments and public health officials' Twitter posts, 2) illustrate the evolution of the Canadian public discourse during the pandemic's first wave by a hashtag trends and 3) provide insights on the public's reaction to the Canadian authorities' tweets through sentiment analysis. To address these objectives, we extracted Twitter posts, replies and associated metadata available during the study period in both English and French. Our results suggest members of the public demonstrated increased engagement with federal officials' Twitter accounts as compared to provincial/territorial Twitter accounts. Hashtag trends analyses illustrated the topic shift in the Canadian public discourse, which initially focused on COVID-19 mitigation strategies and evolved to address emerging issues such as COVID-19 mental health effects. Additionally, we identified 11 sentiments in response to officials' COVID-19 related posts. This study illustrates the potential to leverage social media to understand public discourse during a pandemic. We suggest that routine analyses of such data can provide real-time recommendations to government and public health officials on public sentiments during a public health emergency and can provide useful insights on the accounts/actors with which members of the public are most engaged, which can be leveraged to disseminate key messages

    A pipeline and comparative study of 12 machine learning models for text classification

    Get PDF
    Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.Comment: This article has been accepted for publication in Expert Systems with Applications, April 2022. Published by Elsevier. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio

    Applications in security and evasions in machine learning : a survey

    Get PDF
    In recent years, machine learning (ML) has become an important part to yield security and privacy in various applications. ML is used to address serious issues such as real-time attack detection, data leakage vulnerability assessments and many more. ML extensively supports the demanding requirements of the current scenario of security and privacy across a range of areas such as real-time decision-making, big data processing, reduced cycle time for learning, cost-efficiency and error-free processing. Therefore, in this paper, we review the state of the art approaches where ML is applicable more effectively to fulfill current real-world requirements in security. We examine different security applications' perspectives where ML models play an essential role and compare, with different possible dimensions, their accuracy results. By analyzing ML algorithms in security application it provides a blueprint for an interdisciplinary research area. Even with the use of current sophisticated technology and tools, attackers can evade the ML models by committing adversarial attacks. Therefore, requirements rise to assess the vulnerability in the ML models to cope up with the adversarial attacks at the time of development. Accordingly, as a supplement to this point, we also analyze the different types of adversarial attacks on the ML models. To give proper visualization of security properties, we have represented the threat model and defense strategies against adversarial attack methods. Moreover, we illustrate the adversarial attacks based on the attackers' knowledge about the model and addressed the point of the model at which possible attacks may be committed. Finally, we also investigate different types of properties of the adversarial attacks

    A performance evaluation of machine learning-based streaming spam tweets detection

    Full text link
    The popularity of Twitter attracts more and more spammers. Spammers send unwanted tweets to Twitter users to promote websites or services, which are harmful to normal users. In order to stop spammers, researchers have proposed a number of mechanisms. The focus of recent works is on the application of machine learning techniques into Twitter spam detection. However, tweets are retrieved in a streaming way, and Twitter provides the Streaming API for developers and researchers to access public tweets in real time. There lacks a performance evaluation of existing machine learning-based streaming spam detection methods. In this paper, we bridged the gap by carrying out a performance evaluation, which was from three different aspects of data, feature, and model. A big ground-truth of over 600 million public tweets was created by using a commercial URL-based security tool. For real-time spam detection, we further extracted 12 lightweight features for tweet representation. Spam detection was then transformed to a binary classification problem in the feature space and can be solved by conventional machine learning algorithms. We evaluated the impact of different factors to the spam detection performance, which included spam to nonspam ratio, feature discretization, training data size, data sampling, time-related data, and machine learning algorithms. The results show the streaming spam tweet detection is still a big challenge and a robust detection technique should take into account the three aspects of data, feature, and model

    Online Social Deception and Its Countermeasures for Trustworthy Cyberspace: A Survey

    Full text link
    We are living in an era when online communication over social network services (SNSs) have become an indispensable part of people's everyday lives. As a consequence, online social deception (OSD) in SNSs has emerged as a serious threat in cyberspace, particularly for users vulnerable to such cyberattacks. Cyber attackers have exploited the sophisticated features of SNSs to carry out harmful OSD activities, such as financial fraud, privacy threat, or sexual/labor exploitation. Therefore, it is critical to understand OSD and develop effective countermeasures against OSD for building a trustworthy SNSs. In this paper, we conducted an extensive survey, covering (i) the multidisciplinary concepts of social deception; (ii) types of OSD attacks and their unique characteristics compared to other social network attacks and cybercrimes; (iii) comprehensive defense mechanisms embracing prevention, detection, and response (or mitigation) against OSD attacks along with their pros and cons; (iv) datasets/metrics used for validation and verification; and (v) legal and ethical concerns related to OSD research. Based on this survey, we provide insights into the effectiveness of countermeasures and the lessons from existing literature. We conclude this survey paper with an in-depth discussions on the limitations of the state-of-the-art and recommend future research directions in this area.Comment: 35 pages, 8 figures, submitted to ACM Computing Survey
    corecore