106 research outputs found

    A Semi-Supervised Learning Approach for Tackling Twitter Spam Drift

    Get PDF
    Twitter has changed the way people get information by allowing them to express their opinion and comments on the daily tweets. Unfortunately, due to the high popularity of Twitter, it has become very attractive to spammers. Unlike other types of spam, Twitter spam has become a serious issue in the last few years. The large number of users and the high amount of information being shared on Twitter play an important role in accelerating the spread of spam. In order to protect the users, Twitter and the research community have been developing different spam detection systems by applying different machine-learning techniques. However, a recent study showed that the current machine learning-based detection systems are not able to detect spam accurately because spam tweet characteristics vary over time. This issue is called “Twitter Spam Drift”. In this paper, a semi-supervised learning approach (SSLA) has been proposed to tackle this. The new approach uses the unlabeled data to learn the structure of the domain. Different experiments were performed on English and Arabic datasets to test and evaluate the proposed approach and the results show that the proposed SSLA can reduce the effect of Twitter spam drift and outperform the existing techniques

    Statistical Features-Based Real-Time Detection of Drifted Twitter Spam

    Get PDF
    AcceptedThis is the author accepted manuscript. The final version is available from the publisher via the DOI in this record.Twitter spam has become a critical problem nowadays. Recent works focus on applying machine learning techniques for Twitter spam detection, which make use of the statistical features of tweets. In our labeled tweets data set, however, we observe that the statistical properties of spam tweets vary over time, and thus, the performance of existing machine learning-based classifiers decreases. This issue is referred to as “Twitter Spam Drift”. In order to tackle this problem, we first carry out a deep analysis on the statistical features of one million spam tweets and one million non-spam tweets, and then propose a novel Lfun scheme. The proposed scheme can discover “changed” spam tweets from unlabeled tweets and incorporate them into classifier’s training process. A number of experiments are performed to evaluate the proposed scheme. The results show that our proposed Lfun scheme can significantly improve the spam detection accuracy in real-world scenarios.This work was supported by the ARC Linkage Project under Grant LP120200266. The work of J. Zhang was supported by the National Natural Science Foundation of China under Grant 61401371

    Enhanced Spam Detection System for Twitter Social Networking Platform

    Get PDF
    Twitter social site is one of the most popular Online Social Networking Site (OSN) used by popular people such as Ministers, businessman, large companies, actors to share their information. In this site, around 500 million of tweets are posted monthly by the total 313 million Twitter active users. The widespread of Twitter has drawn the interest of spammers. These malicious actors exploit the platform for various nefarious purposes, including monitoring authentic users, disseminating harmful software, and promoting their agendas through URLs embedded in tweets. They engage in tactics like secret following and unfollowing legitimate users, all with the intent of gathering sensitive information.To resolve this problem, a secure spam detection based on machine learning approach is designed. The designed used stop word removal, word to vector model to refined and dimensionally reduced the data. To enhance the quality of the data Cosine similarity is also been applied to measure the similarity score among the tweets and based upon that Artificial Neural Network (ANN) is trained. Later on, it is used to test the efficiency by examining the performance parameters in terms of precision, recall and F-measure. Also, the comparative analysis has been performed to present the efficiency of the work. The average precision, recall and F measure of proposed spam detection model of 0.9252, 0.6107 and 0.734 are obtained

    Analysing and detecting twitter spam

    Full text link
    Through in-depth data-drive analysis, we provide insights on deceptive information in Twitter spam, spammers\u27 behaviours and emerging spamming strategies. We also firstly identify and solve the &quot;spam drift&quot; problem. Online social network providers can adopt our findings and proposed scheme to re-design their detection system to improve its efficiency and accuracy.<br /

    Cybersecurity and safety analysis in online social networks

    Full text link
    The research work deal with the security and safety issues related to the use of online social networks and it successfully presented AI-based solutions to address these issues in online social networks

    Information quality in online social media and big data collection: an example of Twitter spam detection

    Get PDF
    La popularité des médias sociaux en ligne (Online Social Media - OSM) est fortement liée à la qualité du contenu généré par l'utilisateur (User Generated Content - UGC) et la protection de la vie privée des utilisateurs. En se basant sur la définition de la qualité de l'information, comme son aptitude à être exploitée, la facilité d'utilisation des OSM soulève de nombreux problèmes en termes de la qualité de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problèmes sont causés par des individus mal intentionnés (nommés spammeurs) qui utilisent les OSM pour disséminer des fausses informations et/ou des informations indésirables telles que les contenus commerciaux illégaux. La propagation et la diffusion de telle information, dit spam, entraînent d'énormes problèmes affectant la qualité de services proposés par les OSM. La majorité des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquées par un énorme nombre d'utilisateurs mal intentionnés. Cependant, les techniques de filtrage adoptées par les OSM se sont avérées inefficaces dans le traitement de ce type d'information bruitée, nécessitant plusieurs semaines ou voir plusieurs mois pour filtrer l'information spam. En effet, plusieurs défis doivent être surmontées pour réaliser une méthode de filtrage de l'information bruitée . Les défis majeurs sous-jacents à cette problématique peuvent être résumés par : (i) données de masse ; (ii) vie privée et sécurité ; (iii) hétérogénéité des structures dans les réseaux sociaux ; (iv) diversité des formats du UGC ; (v) subjectivité et objectivité. Notre travail s'inscrit dans le cadre de l'amélioration de la qualité des contenus en termes de messages partagés (contenu spam) et de profils des utilisateurs (spammeurs) sur les OSM en abordant en détail les défis susmentionnés. Comme le spam social est le problème le plus récurant qui apparaît sur les OSM, nous proposons deux approches génériques pour détecter et filtrer le contenu spam : i) La première approche consiste à détecter le contenu spam (par exemple, les tweets spam) dans un flux en temps réel. ii) La seconde approche est dédiée au traitement d'un grand volume des données relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter). Pour filtrer le contenu spam en temps réel, nous introduisons une approche d'apprentissage non supervisée qui permet le filtrage en temps réel des tweets spams dans laquelle la fonction de classification est adaptée automatiquement. La fonction de classification est entraîné de manière itérative et ne requière pas une collection de données annotées manuellement. Dans la deuxième approche, nous traitons le problème de classification des profils utilisateurs dans le contexte d'une collection de données à grande échelle. Nous proposons de faire une recherche dans un espace réduit de profils utilisateurs (une communauté d'utilisateurs) au lieu de traiter chaque profil d'utilisateur à part. Ensuite, chaque profil qui appartient à cet espace réduit est analysé pour prédire sa classe à l'aide d'un modèle de classification binaire. Les expériences menées sur Twitter ont montré que le modèle de classification collective non supervisé proposé est capable de générer une fonction efficace de classification binaire en temps réel des tweets qui s'adapte avec l'évolution des stratégies des spammeurs sociaux sur Twitter. L'approche proposée surpasse les performances de deux méthodes de l'état de l'art de détection de spam en temps réel. Les résultats de la deuxième approche ont démontré que l'extraction des métadonnées des spams et leur exploitation dans le processus de recherche de profils de spammeurs est réalisable dans le contexte de grandes collections de profils Twitter. L'approche proposée est une alternative au traitement de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources consumptions, decreasing quality of service of OSM-based applications, and spending human efforts. The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii) privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification model. The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles existing in the input data collection. The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors and information overloading

    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

    Full text link
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets

    Deep neural networks in the cloud: Review, applications, challenges and research directions

    Get PDF
    Deep neural networks (DNNs) are currently being deployed as machine learning technology in a wide range of important real-world applications. DNNs consist of a huge number of parameters that require millions of floating-point operations (FLOPs) to be executed both in learning and prediction modes. A more effective method is to implement DNNs in a cloud computing system equipped with centralized servers and data storage sub-systems with high-speed and high-performance computing capabilities. This paper presents an up-to-date survey on current state-of-the-art deployed DNNs for cloud computing. Various DNN complexities associated with different architectures are presented and discussed alongside the necessities of using cloud computing. We also present an extensive overview of different cloud computing platforms for the deployment of DNNs and discuss them in detail. Moreover, DNN applications already deployed in cloud computing systems are reviewed to demonstrate the advantages of using cloud computing for DNNs. The paper emphasizes the challenges of deploying DNNs in cloud computing systems and provides guidance on enhancing current and new deployments.The EGIA project (KK-2022/00119The Consolidated Research Group MATHMODE (IT1456-22
    • …
    corecore