61 research outputs found

    Information quality in online social media and big data collection: an example of Twitter spam detection

    Get PDF
    La popularité des médias sociaux en ligne (Online Social Media - OSM) est fortement liée à la qualité du contenu généré par l'utilisateur (User Generated Content - UGC) et la protection de la vie privée des utilisateurs. En se basant sur la définition de la qualité de l'information, comme son aptitude à être exploitée, la facilité d'utilisation des OSM soulève de nombreux problèmes en termes de la qualité de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problèmes sont causés par des individus mal intentionnés (nommés spammeurs) qui utilisent les OSM pour disséminer des fausses informations et/ou des informations indésirables telles que les contenus commerciaux illégaux. La propagation et la diffusion de telle information, dit spam, entraînent d'énormes problèmes affectant la qualité de services proposés par les OSM. La majorité des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquées par un énorme nombre d'utilisateurs mal intentionnés. Cependant, les techniques de filtrage adoptées par les OSM se sont avérées inefficaces dans le traitement de ce type d'information bruitée, nécessitant plusieurs semaines ou voir plusieurs mois pour filtrer l'information spam. En effet, plusieurs défis doivent être surmontées pour réaliser une méthode de filtrage de l'information bruitée . Les défis majeurs sous-jacents à cette problématique peuvent être résumés par : (i) données de masse ; (ii) vie privée et sécurité ; (iii) hétérogénéité des structures dans les réseaux sociaux ; (iv) diversité des formats du UGC ; (v) subjectivité et objectivité. Notre travail s'inscrit dans le cadre de l'amélioration de la qualité des contenus en termes de messages partagés (contenu spam) et de profils des utilisateurs (spammeurs) sur les OSM en abordant en détail les défis susmentionnés. Comme le spam social est le problème le plus récurant qui apparaît sur les OSM, nous proposons deux approches génériques pour détecter et filtrer le contenu spam : i) La première approche consiste à détecter le contenu spam (par exemple, les tweets spam) dans un flux en temps réel. ii) La seconde approche est dédiée au traitement d'un grand volume des données relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter). Pour filtrer le contenu spam en temps réel, nous introduisons une approche d'apprentissage non supervisée qui permet le filtrage en temps réel des tweets spams dans laquelle la fonction de classification est adaptée automatiquement. La fonction de classification est entraîné de manière itérative et ne requière pas une collection de données annotées manuellement. Dans la deuxième approche, nous traitons le problème de classification des profils utilisateurs dans le contexte d'une collection de données à grande échelle. Nous proposons de faire une recherche dans un espace réduit de profils utilisateurs (une communauté d'utilisateurs) au lieu de traiter chaque profil d'utilisateur à part. Ensuite, chaque profil qui appartient à cet espace réduit est analysé pour prédire sa classe à l'aide d'un modèle de classification binaire. Les expériences menées sur Twitter ont montré que le modèle de classification collective non supervisé proposé est capable de générer une fonction efficace de classification binaire en temps réel des tweets qui s'adapte avec l'évolution des stratégies des spammeurs sociaux sur Twitter. L'approche proposée surpasse les performances de deux méthodes de l'état de l'art de détection de spam en temps réel. Les résultats de la deuxième approche ont démontré que l'extraction des métadonnées des spams et leur exploitation dans le processus de recherche de profils de spammeurs est réalisable dans le contexte de grandes collections de profils Twitter. L'approche proposée est une alternative au traitement de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources consumptions, decreasing quality of service of OSM-based applications, and spending human efforts. The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii) privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification model. The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles existing in the input data collection. The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors and information overloading

    Misinformation Detection in Social Media

    Get PDF
    abstract: The pervasive use of social media gives it a crucial role in helping the public perceive reliable information. Meanwhile, the openness and timeliness of social networking sites also allow for the rapid creation and dissemination of misinformation. It becomes increasingly difficult for online users to find accurate and trustworthy information. As witnessed in recent incidents of misinformation, it escalates quickly and can impact social media users with undesirable consequences and wreak havoc instantaneously. Different from some existing research in psychology and social sciences about misinformation, social media platforms pose unprecedented challenges for misinformation detection. First, intentional spreaders of misinformation will actively disguise themselves. Second, content of misinformation may be manipulated to avoid being detected, while abundant contextual information may play a vital role in detecting it. Third, not only accuracy, earliness of a detection method is also important in containing misinformation from being viral. Fourth, social media platforms have been used as a fundamental data source for various disciplines, and these research may have been conducted in the presence of misinformation. To tackle the challenges, we focus on developing machine learning algorithms that are robust to adversarial manipulation and data scarcity. The main objective of this dissertation is to provide a systematic study of misinformation detection in social media. To tackle the challenges of adversarial attacks, I propose adaptive detection algorithms to deal with the active manipulations of misinformation spreaders via content and networks. To facilitate content-based approaches, I analyze the contextual data of misinformation and propose to incorporate the specific contextual patterns of misinformation into a principled detection framework. Considering its rapidly growing nature, I study how misinformation can be detected at an early stage. In particular, I focus on the challenge of data scarcity and propose a novel framework to enable historical data to be utilized for emerging incidents that are seemingly irrelevant. With misinformation being viral, applications that rely on social media data face the challenge of corrupted data. To this end, I present robust statistical relational learning and personalization algorithms to minimize the negative effect of misinformation.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter

    Full text link
    Microblogs are increasingly exploited for predicting prices and traded volumes of stocks in financial markets. However, it has been demonstrated that much of the content shared in microblogging platforms is created and publicized by bots and spammers. Yet, the presence (or lack thereof) and the impact of fake stock microblogs has never systematically been investigated before. Here, we study 9M tweets related to stocks of the 5 main financial markets in the US. By comparing tweets with financial data from Google Finance, we highlight important characteristics of Twitter stock microblogs. More importantly, we uncover a malicious practice - referred to as cashtag piggybacking - perpetrated by coordinated groups of bots and likely aimed at promoting low-value stocks by exploiting the popularity of high-value ones. Among the findings of our study is that as much as 71% of the authors of suspicious financial tweets are classified as bots by a state-of-the-art spambot detection algorithm. Furthermore, 37% of them were suspended by Twitter a few months after our investigation. Our results call for the adoption of spam and bot detection techniques in all studies and applications that exploit user-generated content for predicting the stock market

    Non-Hierarchical Networks for Censorship-Resistant Personal Communication.

    Full text link
    The Internet promises widespread access to the world’s collective information and fast communication among people, but common government censorship and spying undermines this potential. This censorship is facilitated by the Internet’s hierarchical structure. Most traffic flows through routers owned by a small number of ISPs, who can be secretly coerced into aiding such efforts. Traditional crypographic defenses are confusing to common users. This thesis advocates direct removal of the underlying heirarchical infrastructure instead, replacing it with non-hierarchical networks. These networks lack such chokepoints, instead requiring would-be censors to control a substantial fraction of the participating devices—an expensive proposition. We take four steps towards the development of practical non-hierarchical networks. (1) We first describe Whisper, a non-hierarchical mobile ad hoc network (MANET) architecture for personal communication among friends and family that resists censorship and surveillance. At its core are two novel techniques, an efficient routing scheme based on the predictability of human locations anda variant of onion-routing suitable for decentralized MANETs. (2) We describe the design and implementation of Shout, a MANET architecture for censorship-resistant, Twitter-like public microblogging. (3) We describe the Mason test, amethod used to detect Sybil attacks in ad hoc networks in which trusted authorities are not available. (4) We characterize and model the aggregate behavior of Twitter users to enable simulation-based study of systems like Shout. We use our characterization of the retweet graph to analyze a novel spammer detection technique for Shout.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/107314/1/drbild_1.pd

    False News On Social Media: A Data-Driven Survey

    Full text link
    In the past few years, the research community has dedicated growing interest to the issue of false news circulating on social networks. The widespread attention on detecting and characterizing false news has been motivated by considerable backlashes of this threat against the real world. As a matter of fact, social media platforms exhibit peculiar characteristics, with respect to traditional news outlets, which have been particularly favorable to the proliferation of deceptive information. They also present unique challenges for all kind of potential interventions on the subject. As this issue becomes of global concern, it is also gaining more attention in academia. The aim of this survey is to offer a comprehensive study on the recent advances in terms of detection, characterization and mitigation of false news that propagate on social media, as well as the challenges and the open questions that await future research on the field. We use a data-driven approach, focusing on a classification of the features that are used in each study to characterize false information and on the datasets used for instructing classification methods. At the end of the survey, we highlight emerging approaches that look most promising for addressing false news
    • …
    corecore