106 research outputs found
A Semi-Supervised Learning Approach for Tackling Twitter Spam Drift
Twitter has changed the way people get information by allowing them to express their opinion and comments on the daily tweets. Unfortunately, due to the high popularity of Twitter, it has become very attractive to spammers. Unlike other types of spam, Twitter spam has become a serious issue in the last few years. The large number of users and the high amount of information being shared on Twitter play an important role in accelerating the spread of spam. In order to protect the users, Twitter and the research community have been developing different spam detection systems by applying different machine-learning techniques. However, a recent study showed that the current machine learning-based detection systems are not able to detect spam accurately because spam tweet characteristics vary over time. This issue is called “Twitter Spam Drift”. In this paper, a semi-supervised learning approach (SSLA) has been proposed to tackle this. The new approach uses the unlabeled data to learn the structure of the domain. Different experiments were performed on English and Arabic datasets to test and evaluate the proposed approach and the results show that the proposed SSLA can reduce the effect of Twitter spam drift and outperform the existing techniques
Statistical Features-Based Real-Time Detection of Drifted Twitter Spam
AcceptedThis is the author accepted manuscript. The final version is available from the publisher via the DOI in this record.Twitter spam has become a critical problem nowadays. Recent works focus on applying machine learning techniques for Twitter spam detection, which make use of the statistical features of tweets. In our labeled tweets data set, however, we observe that the statistical properties of spam tweets vary over time, and thus, the performance of existing machine learning-based classifiers decreases. This issue is referred to as “Twitter Spam Drift”. In order to tackle this problem, we first carry out a deep analysis on the statistical features of one million spam tweets and one million non-spam tweets, and then propose a novel Lfun scheme. The proposed scheme can discover “changed” spam tweets from unlabeled tweets and incorporate them into classifier’s training process. A number of experiments are performed to evaluate the proposed scheme. The results show that our proposed Lfun scheme can significantly improve the spam detection accuracy in real-world scenarios.This work was supported by the ARC Linkage Project under Grant LP120200266. The work of
J. Zhang was supported by the National Natural Science Foundation of China under Grant 61401371
Enhanced Spam Detection System for Twitter Social Networking Platform
Twitter social site is one of the most popular Online Social Networking Site (OSN) used by popular people such as Ministers, businessman, large companies, actors to share their information. In this site, around 500 million of tweets are posted monthly by the total 313 million Twitter active users. The widespread of Twitter has drawn the interest of spammers. These malicious actors exploit the platform for various nefarious purposes, including monitoring authentic users, disseminating harmful software, and promoting their agendas through URLs embedded in tweets. They engage in tactics like secret following and unfollowing legitimate users, all with the intent of gathering sensitive information.To resolve this problem, a secure spam detection based on machine learning approach is designed. The designed used stop word removal, word to vector model to refined and dimensionally reduced the data. To enhance the quality of the data Cosine similarity is also been applied to measure the similarity score among the tweets and based upon that Artificial Neural Network (ANN) is trained. Later on, it is used to test the efficiency by examining the performance parameters in terms of precision, recall and F-measure. Also, the comparative analysis has been performed to present the efficiency of the work. The average precision, recall and F measure of proposed spam detection model of 0.9252, 0.6107 and 0.734 are obtained
Analysing and detecting twitter spam
Through in-depth data-drive analysis, we provide insights on deceptive information in Twitter spam, spammers\u27 behaviours and emerging spamming strategies. We also firstly identify and solve the "spam drift" problem. Online social network providers can adopt our findings and proposed scheme to re-design their detection system to improve its efficiency and accuracy.<br /
Cybersecurity and safety analysis in online social networks
The research work deal with the security and safety issues related to the use of online social networks and it successfully presented AI-based solutions to address these issues in online social networks
Information quality in online social media and big data collection: an example of Twitter spam detection
La popularité des médias sociaux en ligne (Online Social Media - OSM) est fortement liée à la qualité du contenu généré par l'utilisateur (User Generated Content - UGC) et la
protection de la vie privée des utilisateurs. En se basant sur la définition de la qualité de l'information, comme son aptitude à être exploitée, la facilité d'utilisation des
OSM soulève de nombreux problèmes en termes de la qualité de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problèmes sont causés par des
individus mal intentionnés (nommés spammeurs) qui utilisent les OSM pour disséminer des fausses informations et/ou des informations indésirables telles que les contenus
commerciaux illégaux. La propagation et la diffusion de telle information, dit spam, entraînent d'énormes problèmes affectant la qualité de services proposés par les OSM.
La majorité des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquées par un énorme nombre d'utilisateurs mal intentionnés. Cependant, les techniques de filtrage
adoptées par les OSM se sont avérées inefficaces dans le traitement de ce type d'information bruitée, nécessitant plusieurs semaines ou voir plusieurs mois pour filtrer
l'information spam. En effet, plusieurs défis doivent être surmontées pour réaliser une méthode de filtrage de l'information bruitée . Les défis majeurs sous-jacents à cette
problématique peuvent être résumés par : (i) données de masse ; (ii) vie privée et sécurité ; (iii) hétérogénéité des structures dans les réseaux sociaux ; (iv) diversité des
formats du UGC ; (v) subjectivité et objectivité.
Notre travail s'inscrit dans le cadre de l'amélioration de la qualité des contenus en termes de messages partagés (contenu spam) et de profils des utilisateurs (spammeurs) sur
les OSM en abordant en détail les défis susmentionnés. Comme le spam social est le problème le plus récurant qui apparaît sur les OSM, nous proposons deux approches génériques
pour détecter et filtrer le contenu spam : i) La première approche consiste à détecter le contenu spam (par exemple, les tweets spam) dans un flux en temps réel. ii) La seconde
approche est dédiée au traitement d'un grand volume des données relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter).
Pour filtrer le contenu spam en temps réel, nous introduisons une approche d'apprentissage non supervisée qui permet le filtrage en temps réel des tweets spams dans laquelle la
fonction de classification est adaptée automatiquement. La fonction de classification est entraîné de manière itérative et ne requière pas une collection de données annotées
manuellement.
Dans la deuxième approche, nous traitons le problème de classification des profils utilisateurs dans le contexte d'une collection de données à grande échelle. Nous proposons de
faire une recherche dans un espace réduit de profils utilisateurs (une communauté d'utilisateurs) au lieu de traiter chaque profil d'utilisateur à part. Ensuite, chaque profil
qui appartient à cet espace réduit est analysé pour prédire sa classe à l'aide d'un modèle de classification binaire.
Les expériences menées sur Twitter ont montré que le modèle de classification collective non supervisé proposé est capable de générer une fonction efficace de classification
binaire en temps réel des tweets qui s'adapte avec l'évolution des stratégies des spammeurs sociaux sur Twitter. L'approche proposée surpasse les performances de deux méthodes
de l'état de l'art de détection de spam en temps réel. Les résultats de la deuxième approche ont démontré que l'extraction des métadonnées des spams et leur exploitation dans le
processus de recherche de profils de spammeurs est réalisable dans le contexte de grandes collections de profils Twitter. L'approche proposée est une alternative au traitement
de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality
as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent
applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal
commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources
consumptions, decreasing quality of service of OSM-based applications, and spending human efforts.
The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular
social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a
complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii)
privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations
In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth
the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and
filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling
a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that
automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second
approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in
the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification
model.
The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary
tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam
detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and
leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles
existing in the input data collection.
The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our
long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms
of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors
and information overloading
A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation
Class imbalance (CI) in classification problems arises when the number of
observations belonging to one class is lower than the other. Ensemble learning
combines multiple models to obtain a robust model and has been prominently used
with data augmentation methods to address class imbalance problems. In the last
decade, a number of strategies have been added to enhance ensemble learning and
data augmentation methods, along with new methods such as generative
adversarial networks (GANs). A combination of these has been applied in many
studies, and the evaluation of different combinations would enable a better
understanding and guidance for different application domains. In this paper, we
present a computational study to evaluate data augmentation and ensemble
learning methods used to address prominent benchmark CI problems. We present a
general framework that evaluates 9 data augmentation and 9 ensemble learning
methods for CI problems. Our objective is to identify the most effective
combination for improving classification performance on imbalanced datasets.
The results indicate that combinations of data augmentation methods with
ensemble learning can significantly improve classification performance on
imbalanced datasets. We find that traditional data augmentation methods such as
the synthetic minority oversampling technique (SMOTE) and random oversampling
(ROS) are not only better in performance for selected CI problems, but also
computationally less expensive than GANs. Our study is vital for the
development of novel models for handling imbalanced datasets
Deep neural networks in the cloud: Review, applications, challenges and research directions
Deep neural networks (DNNs) are currently being deployed as machine learning technology in a wide
range of important real-world applications. DNNs consist of a huge number of parameters that require
millions of floating-point operations (FLOPs) to be executed both in learning and prediction modes. A
more effective method is to implement DNNs in a cloud computing system equipped with centralized
servers and data storage sub-systems with high-speed and high-performance computing capabilities.
This paper presents an up-to-date survey on current state-of-the-art deployed DNNs for cloud computing.
Various DNN complexities associated with different architectures are presented and discussed alongside
the necessities of using cloud computing. We also present an extensive overview of different cloud
computing platforms for the deployment of DNNs and discuss them in detail. Moreover, DNN applications
already deployed in cloud computing systems are reviewed to demonstrate the advantages of using
cloud computing for DNNs. The paper emphasizes the challenges of deploying DNNs in cloud computing
systems and provides guidance on enhancing current and new deployments.The EGIA project (KK-2022/00119The
Consolidated Research Group MATHMODE (IT1456-22
- …