93 research outputs found

    Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram

    Full text link
    In response to disinformation and propaganda from Russian online media following the Russian invasion of Ukraine, Russian outlets including Russia Today and Sputnik News were banned throughout Europe. Many of these Russian outlets, in order to reach their audiences, began to heavily promote their content on messaging services like Telegram. In this work, to understand this phenomenon, we study how 16 Russian media outlets have interacted with and utilized 732 Telegram channels throughout 2022. To do this, we utilize a multilingual version of the foundational model MPNet to embed articles and Telegram messages in a shared embedding space and semantically compare content. Leveraging a parallelized version of DP-Means clustering, we perform paragraph-level topic/narrative extraction and time-series analysis with Hawkes Processes. With this approach, across our websites, we find between 2.3% (ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from activity on Telegram. Finally, tracking the spread of individual narratives, we measure the rate at which these websites and channels disseminate content within the Russian media ecosystem

    TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings

    Full text link
    Stance detection is important for understanding different attitudes and beliefs on the Internet. However, given that a passage's stance toward a given topic is often highly dependent on that topic, building a stance detection model that generalizes to unseen topics is difficult. In this work, we propose using contrastive learning as well as an unlabeled dataset of news articles that cover a variety of different topics to train topic-agnostic/TAG and topic-aware/TAW embeddings for use in downstream stance detection. Combining these embeddings in our full TATA model, we achieve state-of-the-art performance across several public stance detection datasets (0.771 F1F_1-score on the Zero-shot VAST dataset). We release our code and data at https://github.com/hanshanley/tata.Comment: Accepted to EMNLP 2023; Updated citation

    Fast Internet-Wide Scanning: A New Security Perspective

    Full text link
    Techniques like passive observation and random sampling let researchers understand many aspects of Internet day-to-day operation, yet these methodologies often focus on popular services or a small demographic of users, rather than providing a comprehensive view of the devices and services that constitute the Internet. As the diversity of devices and the role they play in critical infrastructure increases, so does understanding the dynamics of and securing these hosts. This dissertation shows how fast Internet-wide scanning provides a near-global perspective of edge hosts that enables researchers to uncover security weaknesses that only emerge at scale. First, I show that it is possible to efficiently scan the IPv4 address space. ZMap: a network scanner specifically architected for large-scale research studies can survey the entire IPv4 address space from a single machine in under an hour at 97% of the theoretical maximum speed of gigabit Ethernet with an estimated 98% coverage of publicly available hosts. Building on ZMap, I introduce Censys, a public service that maintains up-to-date and legacy snapshots of the hosts and services running across the public IPv4 address space. Censys enables researchers to efficiently ask a range of security questions. Next, I present four case studies that highlight how Internet-wide scanning can identify new classes of weaknesses that only emerge at scale, uncover unexpected attacks, shed light on previously opaque distributed systems on the Internet, and understand the impact of consequential vulnerabilities. Finally, I explore how in- creased contention over IPv4 addresses introduces new challenges for performing large-scale empirical studies. I conclude with suggested directions that the re- search community needs to consider to retain the degree of visibility that Internet-wide scanning currently provides.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138660/1/zakir_1.pd

    Watch Your Language: Large Language Models and Content Moderation

    Full text link
    Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of modern, commercial LLMs (GPT-3, GPT-3.5, GPT-4) on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we construct 95 LLM moderation-engines prompted with rules from 95 Reddit subcommunities and find that LLMs can be effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we find that LLMs significantly outperform existing commercially available toxicity classifiers. However, we also find that recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation

    Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter

    Full text link
    Social media platforms are often blamed for exacerbating political polarization and worsening public dialogue. Many claim hyperpartisan users post pernicious content, slanted to their political views, inciting contentious and toxic conversations. However, what factors, actually contribute to increased online toxicity and negative interactions? In this work, we explore the role that political ideology plays in contributing to toxicity both on an individual user level and a topic level on Twitter. To do this, we train and open-source a DeBERTa-based toxicity detector with a contrastive objective that outperforms the Google Jigsaw Persective Toxicity detector on the Civil Comments test dataset. Then, after collecting 187 million tweets from 55,415 Twitter users, we determine how several account-level characteristics, including political ideology and account age, predict how often each user posts toxic content. Running a linear regression, we find that the diversity of views and the toxicity of the other accounts with which that user engages has a more marked effect on their own toxicity. Namely, toxic comments are correlated with users who engage with a wider array of political views. Performing topic analysis on the toxic content posted by these accounts using the large language model MPNet and a version of the DP-Means clustering algorithm, we find similar behavior across 6,592 individual topics, with conversations on each topic becoming more toxic as a wider diversity of users become involved

    The Code the World Depends On: A First Look at Technology Makers' Open Source Software Dependencies

    Full text link
    Open-source software (OSS) supply chain security has become a topic of concern for organizations. Patching an OSS vulnerability can require updating other dependent software products in addition to the original package. However, the landscape of OSS dependencies is not well explored: we do not know what packages are most critical to patch, hindering efforts to improve OSS security where it is most needed. There is thus a need to understand OSS usage in major software and device makers' products. Our work takes a first step toward closing this knowledge gap. We investigate published OSS dependency information for 108 major software and device makers, cataloging how available and how detailed this information is and identifying the OSS packages that appear the most frequently in our data

    Data-driven curation, learning and analysis for inferring evolving IoT botnets in the wild

    Get PDF
    © 2019 Association for Computing Machinery. The insecurity of the Internet-of-Things (IoT) paradigm continues to wreak havoc in consumer and critical infrastructure realms. Several challenges impede addressing IoT security at large, including, the lack of IoT-centric data that can be collected, analyzed and correlated, due to the highly heterogeneous nature of such devices and their widespread deployments in Internet-wide environments. To this end, this paper explores macroscopic, passive empirical data to shed light on this evolving threat phenomena. This not only aims at classifying and inferring Internet-scale compromised IoT devices by solely observing such one-way network traffic, but also endeavors to uncover, track and report on orchestrated “in the wild” IoT botnets. Initially, to prepare the effective utilization of such data, a novel probabilistic model is designed and developed to cleanse such traffic from noise samples (i.e., misconfiguration traffic). Subsequently, several shallow and deep learning models are evaluated to ultimately design and develop a multi-window convolution neural network trained on active and passive measurements to accurately identify compromised IoT devices. Consequently, to infer orchestrated and unsolicited activities that have been generated by well-coordinated IoT botnets, hierarchical agglomerative clustering is deployed by scrutinizing a set of innovative and efficient network feature sets. By analyzing 3.6 TB of recent darknet traffic, the proposed approach uncovers a momentous 440,000 compromised IoT devices and generates evidence-based artifacts related to 350 IoT botnets. While some of these detected botnets refer to previously documented campaigns such as the Hide and Seek, Hajime and Fbot, other events illustrate evolving threats such as those with cryptojacking capabilities and those that are targeting industrial control system communication and control services

    Stratosphere: Finding Vulnerable Cloud Storage Buckets

    Full text link
    Misconfigured cloud storage buckets have leaked hundreds of millions of medical, voter, and customer records. These breaches are due to a combination of easily-guessable bucket names and error-prone security configurations, which, together, allow attackers to easily guess and access sensitive data. In this work, we investigate the security of buckets, finding that prior studies have largely underestimated cloud insecurity by focusing on simple, easy-to-guess names. By leveraging prior work in the password analysis space, we introduce Stratosphere, a system that learns how buckets are named in practice in order to efficiently guess the names of vulnerable buckets. Using Stratosphere, we find wide-spread exploitation of buckets and vulnerable configurations continuing to increase over the years. We conclude with recommendations for operators, researchers, and cloud providers.Comment: Proceedings of the 24th International Symposium on Research in Attacks, Intrusions and Defenses. 202

    A Golden Age: Conspiracy Theories' Relationship with Misinformation Outlets, News Media, and the Wider Internet

    Full text link
    Do we live in a "Golden Age of Conspiracy Theories?" In the last few decades, conspiracy theories have proliferated on the Internet with some having dangerous real-world consequences. A large contingent of those who participated in the January 6th attack on the US Capitol believed fervently in the QAnon conspiracy theory. In this work, we study the relationships amongst five prominent conspiracy theories (QAnon, COVID, UFO/Aliens, 9-11, and Flat-Earth) and each of their respective relationships to the news media, both mainstream and fringe. Identifying and publishing a set of 755 different conspiracy theory websites dedicated to our five conspiracy theories, we find that each set often hyperlinks to the same external domains, with COVID and QAnon conspiracy theory websites largest amount of shared connections. Examining the role of news media, we further find that not only do outlets known for spreading misinformation hyperlink to our set of conspiracy theory websites more often than mainstream websites but this hyperlinking has increased dramatically between 2018 and 2021, with the advent of QAnon and the start of COVID-19 pandemic. Using partial Granger-causality, we uncover several positive correlative relationships between the hyperlinks from misinformation websites and the popularity of conspiracy theory websites, suggesting the prominent role that misinformation news outlets play in popularizing many conspiracy theories

    Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit

    Full text link
    In the buildup to and in the weeks following the Russian Federation's invasion of Ukraine, Russian state media outlets output torrents of misleading and outright false information. In this work, we study this coordinated information campaign in order to understand the most prominent state media narratives touted by the Russian government to English-speaking audiences. To do this, we first perform sentence-level topic analysis using the large-language model MPNet on articles published by ten different pro-Russian propaganda websites including the new Russian "fact-checking" website waronfakes.com. Within this ecosystem, we show that smaller websites like katehon.com were highly effective at publishing topics that were later echoed by other Russian sites. After analyzing this set of Russian information narratives, we then analyze their correspondence with narratives and topics of discussion on the r/Russia and 10 other political subreddits. Using MPNet and a semantic search algorithm, we map these subreddits' comments to the set of topics extracted from our set of Russian websites, finding that 39.6% of r/Russia comments corresponded to narratives from pro-Russian propaganda websites compared to 8.86% on r/politics.Comment: Accepted to ICWSM 202
    corecore