61 research outputs found
Information quality in online social media and big data collection: an example of Twitter spam detection
La popularité des médias sociaux en ligne (Online Social Media - OSM) est fortement liée à la qualité du contenu généré par l'utilisateur (User Generated Content - UGC) et la
protection de la vie privée des utilisateurs. En se basant sur la définition de la qualité de l'information, comme son aptitude à être exploitée, la facilité d'utilisation des
OSM soulève de nombreux problèmes en termes de la qualité de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problèmes sont causés par des
individus mal intentionnés (nommés spammeurs) qui utilisent les OSM pour disséminer des fausses informations et/ou des informations indésirables telles que les contenus
commerciaux illégaux. La propagation et la diffusion de telle information, dit spam, entraînent d'énormes problèmes affectant la qualité de services proposés par les OSM.
La majorité des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquées par un énorme nombre d'utilisateurs mal intentionnés. Cependant, les techniques de filtrage
adoptées par les OSM se sont avérées inefficaces dans le traitement de ce type d'information bruitée, nécessitant plusieurs semaines ou voir plusieurs mois pour filtrer
l'information spam. En effet, plusieurs défis doivent être surmontées pour réaliser une méthode de filtrage de l'information bruitée . Les défis majeurs sous-jacents à cette
problématique peuvent être résumés par : (i) données de masse ; (ii) vie privée et sécurité ; (iii) hétérogénéité des structures dans les réseaux sociaux ; (iv) diversité des
formats du UGC ; (v) subjectivité et objectivité.
Notre travail s'inscrit dans le cadre de l'amélioration de la qualité des contenus en termes de messages partagés (contenu spam) et de profils des utilisateurs (spammeurs) sur
les OSM en abordant en détail les défis susmentionnés. Comme le spam social est le problème le plus récurant qui apparaît sur les OSM, nous proposons deux approches génériques
pour détecter et filtrer le contenu spam : i) La première approche consiste à détecter le contenu spam (par exemple, les tweets spam) dans un flux en temps réel. ii) La seconde
approche est dédiée au traitement d'un grand volume des données relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter).
Pour filtrer le contenu spam en temps réel, nous introduisons une approche d'apprentissage non supervisée qui permet le filtrage en temps réel des tweets spams dans laquelle la
fonction de classification est adaptée automatiquement. La fonction de classification est entraîné de manière itérative et ne requière pas une collection de données annotées
manuellement.
Dans la deuxième approche, nous traitons le problème de classification des profils utilisateurs dans le contexte d'une collection de données à grande échelle. Nous proposons de
faire une recherche dans un espace réduit de profils utilisateurs (une communauté d'utilisateurs) au lieu de traiter chaque profil d'utilisateur à part. Ensuite, chaque profil
qui appartient à cet espace réduit est analysé pour prédire sa classe à l'aide d'un modèle de classification binaire.
Les expériences menées sur Twitter ont montré que le modèle de classification collective non supervisé proposé est capable de générer une fonction efficace de classification
binaire en temps réel des tweets qui s'adapte avec l'évolution des stratégies des spammeurs sociaux sur Twitter. L'approche proposée surpasse les performances de deux méthodes
de l'état de l'art de détection de spam en temps réel. Les résultats de la deuxième approche ont démontré que l'extraction des métadonnées des spams et leur exploitation dans le
processus de recherche de profils de spammeurs est réalisable dans le contexte de grandes collections de profils Twitter. L'approche proposée est une alternative au traitement
de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality
as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent
applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal
commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources
consumptions, decreasing quality of service of OSM-based applications, and spending human efforts.
The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular
social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a
complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii)
privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations
In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth
the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and
filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling
a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that
automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second
approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in
the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification
model.
The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary
tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam
detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and
leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles
existing in the input data collection.
The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our
long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms
of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors
and information overloading
Misinformation Detection in Social Media
abstract: The pervasive use of social media gives it a crucial role in helping the public perceive reliable information. Meanwhile, the openness and timeliness of social networking sites also allow for the rapid creation and dissemination of misinformation. It becomes increasingly difficult for online users to find accurate and trustworthy information. As witnessed in recent incidents of misinformation, it escalates quickly and can impact social media users with undesirable consequences and wreak havoc instantaneously. Different from some existing research in psychology and social sciences about misinformation, social media platforms pose unprecedented challenges for misinformation detection. First, intentional spreaders of misinformation will actively disguise themselves. Second, content of misinformation may be manipulated to avoid being detected, while abundant contextual information may play a vital role in detecting it. Third, not only accuracy, earliness of a detection method is also important in containing misinformation from being viral. Fourth, social media platforms have been used as a fundamental data source for various disciplines, and these research may have been conducted in the presence of misinformation. To tackle the challenges, we focus on developing machine learning algorithms that are robust to adversarial manipulation and data scarcity.
The main objective of this dissertation is to provide a systematic study of misinformation detection in social media. To tackle the challenges of adversarial attacks, I propose adaptive detection algorithms to deal with the active manipulations of misinformation spreaders via content and networks. To facilitate content-based approaches, I analyze the contextual data of misinformation and propose to incorporate the specific contextual patterns of misinformation into a principled detection framework. Considering its rapidly growing nature, I study how misinformation can be detected at an early stage. In particular, I focus on the challenge of data scarcity and propose a novel framework to enable historical data to be utilized for emerging incidents that are seemingly irrelevant. With misinformation being viral, applications that rely on social media data face the challenge of corrupted data. To this end, I present robust statistical relational learning and personalization algorithms to minimize the negative effect of misinformation.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter
Microblogs are increasingly exploited for predicting prices and traded
volumes of stocks in financial markets. However, it has been demonstrated that
much of the content shared in microblogging platforms is created and publicized
by bots and spammers. Yet, the presence (or lack thereof) and the impact of
fake stock microblogs has never systematically been investigated before. Here,
we study 9M tweets related to stocks of the 5 main financial markets in the US.
By comparing tweets with financial data from Google Finance, we highlight
important characteristics of Twitter stock microblogs. More importantly, we
uncover a malicious practice - referred to as cashtag piggybacking -
perpetrated by coordinated groups of bots and likely aimed at promoting
low-value stocks by exploiting the popularity of high-value ones. Among the
findings of our study is that as much as 71% of the authors of suspicious
financial tweets are classified as bots by a state-of-the-art spambot detection
algorithm. Furthermore, 37% of them were suspended by Twitter a few months
after our investigation. Our results call for the adoption of spam and bot
detection techniques in all studies and applications that exploit
user-generated content for predicting the stock market
Non-Hierarchical Networks for Censorship-Resistant Personal Communication.
The Internet promises widespread access to the world’s collective information and fast communication among people, but common government censorship and spying undermines this potential. This censorship is facilitated by the Internet’s hierarchical structure. Most traffic flows through routers owned by a small number of ISPs, who can be secretly coerced into aiding such efforts. Traditional crypographic defenses are confusing to common users. This thesis advocates direct removal of the underlying heirarchical infrastructure instead, replacing it with non-hierarchical networks. These networks lack such chokepoints, instead requiring would-be censors to control a substantial fraction of the participating devices—an expensive proposition. We take four steps towards the development of practical non-hierarchical networks. (1) We first describe Whisper, a non-hierarchical mobile ad hoc network (MANET) architecture for personal communication among friends and family
that resists censorship and surveillance. At its core are two novel techniques, an efficient routing scheme based on the predictability of human locations anda variant of onion-routing suitable for decentralized MANETs. (2) We describe the design and implementation of Shout, a MANET architecture for censorship-resistant, Twitter-like public microblogging. (3) We describe the Mason test, amethod used to detect Sybil attacks in ad hoc networks in which trusted authorities are not available. (4) We characterize and model the aggregate behavior of Twitter users to enable simulation-based study of systems like Shout. We use our characterization of the retweet graph to analyze a novel spammer detection technique for Shout.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/107314/1/drbild_1.pd
False News On Social Media: A Data-Driven Survey
In the past few years, the research community has dedicated growing interest
to the issue of false news circulating on social networks. The widespread
attention on detecting and characterizing false news has been motivated by
considerable backlashes of this threat against the real world. As a matter of
fact, social media platforms exhibit peculiar characteristics, with respect to
traditional news outlets, which have been particularly favorable to the
proliferation of deceptive information. They also present unique challenges for
all kind of potential interventions on the subject. As this issue becomes of
global concern, it is also gaining more attention in academia. The aim of this
survey is to offer a comprehensive study on the recent advances in terms of
detection, characterization and mitigation of false news that propagate on
social media, as well as the challenges and the open questions that await
future research on the field. We use a data-driven approach, focusing on a
classification of the features that are used in each study to characterize
false information and on the datasets used for instructing classification
methods. At the end of the survey, we highlight emerging approaches that look
most promising for addressing false news
- …