11 research outputs found

    Identifying Misinformation on YouTube through Transcript Contextual Analysis with Transformer Models

    Full text link
    Misinformation on YouTube is a significant concern, necessitating robust detection strategies. In this paper, we introduce a novel methodology for video classification, focusing on the veracity of the content. We convert the conventional video classification task into a text classification task by leveraging the textual content derived from the video transcripts. We employ advanced machine learning techniques like transfer learning to solve the classification challenge. Our approach incorporates two forms of transfer learning: (a) fine-tuning base transformer models such as BERT, RoBERTa, and ELECTRA, and (b) few-shot learning using sentence-transformers MPNet and RoBERTa-large. We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset (a collection of articles). Including the Fake-News dataset extended the evaluation of our approach beyond YouTube videos. Using these datasets, we evaluated the models distinguishing valid information from misinformation. The fine-tuned models yielded Matthews Correlation Coefficient>0.81, accuracy>0.90, and F1 score>0.90 in two of three datasets. Interestingly, the few-shot models outperformed the fine-tuned ones by 20% in both Accuracy and F1 score for the YouTube Pseudoscience dataset, highlighting the potential utility of this approach -- especially in the context of limited training data

    A graph exploration method for identifying influential spreaders in complex networks

    No full text
    Abstract The problem of identifying the influential spreaders - the important nodes - in a real world network is of high importance due to its theoretical interest as well as its practical applications, such as the acceleration of information diffusion, the control of the spread of a disease and the improvement of the resilience of networks to external attacks. In this paper, we propose a graph exploration sampling method that accurately identifies the influential spreaders in a complex network, without any prior knowledge of the original graph, apart from the collected samples/subgraphs. The method explores the graph, following a deterministic selection rule and outputs a graph sample - the set of edges that have been crossed. The proposed method is based on a version of Rank Degree graph sampling algorithm. We conduct extensive experiments in eight real world networks by simulating the susceptible-infected-recovered (SIR) and susceptible-infected-susceptible (SIS) epidemic models which serve as ground truth identifiers of nodes spreading efficiency. Experimentally, we show that by exploring only the 20% of the network and using the degree centrality as well as the k-core measure, we are able to identify the influential spreaders with at least the same accuracy as in the full information case, namely, the case where we have access to the original graph and in that graph, we compute the centrality measures. Finally and more importantly, we present strong evidence that the degree centrality - the degree of nodes in the collected samples - is almost as accurate as the k-core values obtained from the original graph

    Privacy-Preserving Online Content Moderation with Federated Learning

    No full text
    Users are exposed to a large volume of harmful content that appears daily on various social network platforms. One solution to users' protection is developing online moderation tools using Machine Learning (ML) techniques for automatic detection or content filtering. On the other hand, the processing of user data requires compliance with privacy policies. This paper proposes a privacy-preserving Federated Learning (FL) framework for online content moderation that incorporates Central Differential Privacy (CDP). We simulate the FL training of a classifier for detecting tweets with harmful content, and we show that the performance of the FL framework can be close to the centralized approach. Moreover, it has a high performance even if a small number of clients (each with a small number of tweets) are available for the FL training. When reducing the number of clients (from fifty to ten) or the tweets per client (from 1K to 100), the classifier can still achieve AUC. Furthermore, we extend the evaluation to four other Twitter datasets that capture different types of user misbehavior and still obtain a promising performance (61% - 80% AUC)

    Privacy-Preserving Online Content Moderation: A Federated Learning Use Case

    No full text
    Users are exposed to a large volume of harmful content that appears daily on various social network platforms. One solution to users' protection is developing online moderation tools using Machine Learning (ML) techniques for automatic detection or content filtering. On the other hand, the processing of user data requires compliance with privacy policies. In this paper, we propose a framework for developing content moderation tools in a privacy-preserving manner where sensitive information stays on the users' device. For this purpose, we apply Differentially Private Federated Learning (DP-FL), where the training of ML models is performed locally on the users' devices, and only the model updates are shared with a central entity. To demonstrate the utility of our approach, we simulate harmful text classification on Twitter data in a distributed FL fashion- but the overall concept can be generalized to other types of misbehavior, data, and platforms. We show that the performance of the proposed FL framework can be close to the centralized approach - for both the DP-FL and non-DP FL. Moreover, it has a high performance even if a small number of clients (each with a small number of tweets) are available for the FL training. When reducing the number of clients (from fifty to ten) or the tweets per client (from 1K to 100), the classifier can still achieve AUC. Furthermore, we extend the evaluation to four other Twitter datasets that capture different types of user misbehavior and still obtain a promising performance (61% - 80% AUC). Finally, we explore the overhead on the users' devices during the FL training phase and show that the local training does not introduce excessive CPU utilization and memory consumption overhead

    Did State-Sponsored Trolls Shape the 2016 US Presidential Election Discourse? Quantifying Influence on Twitter

    No full text
    It is a widely accepted fact that state-sponsored Twitter accounts operated during the 2016 US presidential election, spreading millions of tweets with misinformation and inflammatory political content. Whether these social media campaigns of the so-called “troll” accounts were able to manipulate public opinion is still in question. Here, we quantify the influence of troll accounts on Twitter by analyzing 152.5 million tweets (by 9.9 million users) from that period. The data contain original tweets from 822 troll accounts identified as such by Twitter. We construct and analyze a very large interaction graph of 9.3 million nodes and 169.9 million edges using graph analysis techniques and a game-theoretic centrality measure. Then, we quantify the influence of all Twitter accounts on the overall information exchange as defined by the retweet cascades. We provide a global influence ranking of all Twitter accounts, and we find that one troll account appears in the top-100 and four in the top-1000. This, combined with other findings presented in this paper, constitute evidence that the driving force of virality and influence in the network came from regular users - users who have not been classified as trolls by Twitter. On the other hand, we find that, on average, troll accounts were tens of times more influential than regular users were. Moreover, 23% and 22% of regular accounts in the top-100 and top-1000, respectively, have now been suspended by Twitter. This raises questions about their authenticity and practices during the 2016 US presidential election

    Did State-Sponsored Trolls Shape the 2016 US Presidential Election Discourse? Quantifying Influence on Twitter

    No full text
    It is a widely accepted fact that state-sponsored Twitter accounts operated during the 2016 US presidential election, spreading millions of tweets with misinformation and inflammatory political content. Whether these social media campaigns of the so-called “troll” accounts were able to manipulate public opinion is still in question. Here, we quantify the influence of troll accounts on Twitter by analyzing 152.5 million tweets (by 9.9 million users) from that period. The data contain original tweets from 822 troll accounts identified as such by Twitter. We construct and analyze a very large interaction graph of 9.3 million nodes and 169.9 million edges using graph analysis techniques and a game-theoretic centrality measure. Then, we quantify the influence of all Twitter accounts on the overall information exchange as defined by the retweet cascades. We provide a global influence ranking of all Twitter accounts, and we find that one troll account appears in the top-100 and four in the top-1000. This, combined with other findings presented in this paper, constitute evidence that the driving force of virality and influence in the network came from regular users - users who have not been classified as trolls by Twitter. On the other hand, we find that, on average, troll accounts were tens of times more influential than regular users were. Moreover, 23% and 22% of regular accounts in the top-100 and top-1000, respectively, have now been suspended by Twitter. This raises questions about their authenticity and practices during the 2016 US presidential election

    A Unified Graph-Based Approach to Disinformation Detection using Contextual and Semantic Relations

    No full text
    As recent events have demonstrated, disinformation spread through social networks can have dire political, economic and social consequences. Detecting disinformation must inevitably rely on the structure of the network, on users particularities and on event occurrence patterns. We present a graph data structure, which we denote as a meta-graph, that combines underlying users' relational event information, as well as semantic and topical modeling. We detail the construction of an example meta-graph using Twitter data covering the 2016 US election campaign and then compare the detection of disinformation at cascade level, using well-known graph neural network algorithms, to the same algorithms applied on the meta-graph nodes. The comparison shows a consistent 3%-4% improvement in accuracy when using the meta-graph, over all considered algorithms, compared to basic cascade classification, and a further 1% increase when topic modeling and sentiment analysis are considered. We carry out the same experiment on two other datasets, HealthRelease and HealthStory, part of the FakeHealth dataset repository, with consistent results. Finally, we discuss further advantages of our approach, such as the ability to augment the graph structure using external data sources, the ease with which multiple meta-graphs can be combined as well as a comparison of our method to other graph-based disinformation detection frameworks

    A Qualitative Analysis of Illicit Arms Trafficking on Darknet Marketplaces

    No full text
    During the last decade, the dark web has become the playground for criminal and underground activities, such as marketplaces of drugs and guns, as well as illegal content sharing. The dark web is one of the top crime environments presented in EUROPOL's Internet Organised Crime Threat Assessment 2021. This paper provides a qualitative study on the darknet marketplaces of illegal arms trafficking. For this purpose, we implemented a crawler based on the ACHE Python library to collect hidden web pages (onion services) on the Tor network. We gathered data from ten marketplaces recommended by dark web search engines - Ahmia, Deep Search, and Onion Land Search. We provide a first report of the overall landscape of illicit arms trafficking, discussing the range of weapons such as military drones, explosives, and other related products, together with the payment and shipping methods provided by the vendors. The findings verify previous reports from reputable institutions (United Nations and RAND Europe). Most of these illicit marketplaces are easily accessible to the average user; they are well-organized with a large variety of firearms and also provide extensive customer support

    HyperGraphDis: Leveraging Hypergraphs for Contextual and Social-Based Disinformation Detection

    No full text
    In light of the growing impact of disinformation on social, economic, and political landscapes, accurate and efficient identification methods are increasingly critical. This paper introduces HyperGraphDis, a novel approach for detecting disinformation on Twitter that employs a hypergraph-based representation to capture (i) the intricate social structures arising from retweet cascades, (ii) relational features among users, and (iii) semantic and topical nuances. Evaluated on four Twitter datasets -- focusing on the 2016 U.S. presidential election and the COVID-19 pandemic -- HyperGraphDis outperforms existing methods in both accuracy and computational efficiency, underscoring its effectiveness and scalability for tackling the challenges posed by disinformation dissemination. HyperGraphDis displays exceptional performance on a COVID-19-related dataset, achieving an impressive F1 score (weighted) of approximately 89.5%. This result represents a notable improvement of around 4% compared to the other state-of-the-art methods. Additionally, significant enhancements in computation time are observed for both model training and inference. In terms of model training, completion times are accelerated by a factor ranging from 2.3 to 7.6 compared to the second-best method across the four datasets. Similarly, during inference, computation times are 1.3 to 6.8 times faster than the state-of-the-art
    corecore