93 research outputs found
Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram
In response to disinformation and propaganda from Russian online media
following the Russian invasion of Ukraine, Russian outlets including Russia
Today and Sputnik News were banned throughout Europe. Many of these Russian
outlets, in order to reach their audiences, began to heavily promote their
content on messaging services like Telegram. In this work, to understand this
phenomenon, we study how 16 Russian media outlets have interacted with and
utilized 732 Telegram channels throughout 2022. To do this, we utilize a
multilingual version of the foundational model MPNet to embed articles and
Telegram messages in a shared embedding space and semantically compare content.
Leveraging a parallelized version of DP-Means clustering, we perform
paragraph-level topic/narrative extraction and time-series analysis with Hawkes
Processes. With this approach, across our websites, we find between 2.3%
(ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from
activity on Telegram. Finally, tracking the spread of individual narratives, we
measure the rate at which these websites and channels disseminate content
within the Russian media ecosystem
TATA: Stance Detection via Topic-Agnostic and Topic-Aware Embeddings
Stance detection is important for understanding different attitudes and
beliefs on the Internet. However, given that a passage's stance toward a given
topic is often highly dependent on that topic, building a stance detection
model that generalizes to unseen topics is difficult. In this work, we propose
using contrastive learning as well as an unlabeled dataset of news articles
that cover a variety of different topics to train topic-agnostic/TAG and
topic-aware/TAW embeddings for use in downstream stance detection. Combining
these embeddings in our full TATA model, we achieve state-of-the-art
performance across several public stance detection datasets (0.771 -score
on the Zero-shot VAST dataset). We release our code and data at
https://github.com/hanshanley/tata.Comment: Accepted to EMNLP 2023; Updated citation
Fast Internet-Wide Scanning: A New Security Perspective
Techniques like passive observation and random sampling let researchers understand many aspects of Internet day-to-day operation, yet these methodologies often focus on popular services or a small demographic of users, rather than providing a comprehensive view of the devices and services that constitute the Internet. As the diversity of devices and the role they play in critical infrastructure increases, so does understanding the dynamics of and securing these hosts. This dissertation shows how fast Internet-wide scanning provides a near-global perspective of edge hosts that enables researchers to uncover security weaknesses that only emerge at scale.
First, I show that it is possible to efficiently scan the IPv4 address space. ZMap: a network scanner specifically architected for large-scale research studies can survey the entire IPv4 address space from a single machine in under an hour at 97% of the theoretical maximum speed of gigabit Ethernet with an estimated 98% coverage of publicly available hosts. Building on ZMap, I introduce Censys, a public service that maintains up-to-date and legacy snapshots of the hosts and services running across the public IPv4 address space. Censys enables researchers to efficiently ask a range of security questions.
Next, I present four case studies that highlight how Internet-wide scanning can identify new classes of weaknesses that only emerge at scale, uncover unexpected attacks, shed light on previously opaque distributed systems on the Internet, and understand the impact of consequential vulnerabilities. Finally, I explore how in- creased contention over IPv4 addresses introduces new challenges for performing large-scale empirical studies. I conclude with suggested directions that the re- search community needs to consider to retain the degree of visibility that Internet-wide scanning currently provides.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138660/1/zakir_1.pd
Watch Your Language: Large Language Models and Content Moderation
Large language models (LLMs) have exploded in popularity due to their ability
to perform a wide array of natural language tasks. Text-based content
moderation is one LLM use case that has received recent enthusiasm, however,
there is little research investigating how LLMs perform in content moderation
settings. In this work, we evaluate a suite of modern, commercial LLMs (GPT-3,
GPT-3.5, GPT-4) on two common content moderation tasks: rule-based community
moderation and toxic content detection. For rule-based community moderation, we
construct 95 LLM moderation-engines prompted with rules from 95 Reddit
subcommunities and find that LLMs can be effective at rule-based moderation for
many communities, achieving a median accuracy of 64% and a median precision of
83%. For toxicity detection, we find that LLMs significantly outperform
existing commercially available toxicity classifiers. However, we also find
that recent increases in model size add only marginal benefit to toxicity
detection, suggesting a potential performance plateau for LLMs on toxicity
detection tasks. We conclude by outlining avenues for future work in studying
LLMs and content moderation
Twits, Toxic Tweets, and Tribal Tendencies: Trends in Politically Polarized Posts on Twitter
Social media platforms are often blamed for exacerbating political
polarization and worsening public dialogue. Many claim hyperpartisan users post
pernicious content, slanted to their political views, inciting contentious and
toxic conversations. However, what factors, actually contribute to increased
online toxicity and negative interactions? In this work, we explore the role
that political ideology plays in contributing to toxicity both on an individual
user level and a topic level on Twitter. To do this, we train and open-source a
DeBERTa-based toxicity detector with a contrastive objective that outperforms
the Google Jigsaw Persective Toxicity detector on the Civil Comments test
dataset. Then, after collecting 187 million tweets from 55,415 Twitter users,
we determine how several account-level characteristics, including political
ideology and account age, predict how often each user posts toxic content.
Running a linear regression, we find that the diversity of views and the
toxicity of the other accounts with which that user engages has a more marked
effect on their own toxicity. Namely, toxic comments are correlated with users
who engage with a wider array of political views. Performing topic analysis on
the toxic content posted by these accounts using the large language model MPNet
and a version of the DP-Means clustering algorithm, we find similar behavior
across 6,592 individual topics, with conversations on each topic becoming more
toxic as a wider diversity of users become involved
The Code the World Depends On: A First Look at Technology Makers' Open Source Software Dependencies
Open-source software (OSS) supply chain security has become a topic of
concern for organizations. Patching an OSS vulnerability can require updating
other dependent software products in addition to the original package. However,
the landscape of OSS dependencies is not well explored: we do not know what
packages are most critical to patch, hindering efforts to improve OSS security
where it is most needed. There is thus a need to understand OSS usage in major
software and device makers' products. Our work takes a first step toward
closing this knowledge gap. We investigate published OSS dependency information
for 108 major software and device makers, cataloging how available and how
detailed this information is and identifying the OSS packages that appear the
most frequently in our data
Data-driven curation, learning and analysis for inferring evolving IoT botnets in the wild
© 2019 Association for Computing Machinery. The insecurity of the Internet-of-Things (IoT) paradigm continues to wreak havoc in consumer and critical infrastructure realms. Several challenges impede addressing IoT security at large, including, the lack of IoT-centric data that can be collected, analyzed and correlated, due to the highly heterogeneous nature of such devices and their widespread deployments in Internet-wide environments. To this end, this paper explores macroscopic, passive empirical data to shed light on this evolving threat phenomena. This not only aims at classifying and inferring Internet-scale compromised IoT devices by solely observing such one-way network traffic, but also endeavors to uncover, track and report on orchestrated “in the wild” IoT botnets. Initially, to prepare the effective utilization of such data, a novel probabilistic model is designed and developed to cleanse such traffic from noise samples (i.e., misconfiguration traffic). Subsequently, several shallow and deep learning models are evaluated to ultimately design and develop a multi-window convolution neural network trained on active and passive measurements to accurately identify compromised IoT devices. Consequently, to infer orchestrated and unsolicited activities that have been generated by well-coordinated IoT botnets, hierarchical agglomerative clustering is deployed by scrutinizing a set of innovative and efficient network feature sets. By analyzing 3.6 TB of recent darknet traffic, the proposed approach uncovers a momentous 440,000 compromised IoT devices and generates evidence-based artifacts related to 350 IoT botnets. While some of these detected botnets refer to previously documented campaigns such as the Hide and Seek, Hajime and Fbot, other events illustrate evolving threats such as those with cryptojacking capabilities and those that are targeting industrial control system communication and control services
Stratosphere: Finding Vulnerable Cloud Storage Buckets
Misconfigured cloud storage buckets have leaked hundreds of millions of
medical, voter, and customer records. These breaches are due to a combination
of easily-guessable bucket names and error-prone security configurations,
which, together, allow attackers to easily guess and access sensitive data. In
this work, we investigate the security of buckets, finding that prior studies
have largely underestimated cloud insecurity by focusing on simple,
easy-to-guess names. By leveraging prior work in the password analysis space,
we introduce Stratosphere, a system that learns how buckets are named in
practice in order to efficiently guess the names of vulnerable buckets. Using
Stratosphere, we find wide-spread exploitation of buckets and vulnerable
configurations continuing to increase over the years. We conclude with
recommendations for operators, researchers, and cloud providers.Comment: Proceedings of the 24th International Symposium on Research in
Attacks, Intrusions and Defenses. 202
A Golden Age: Conspiracy Theories' Relationship with Misinformation Outlets, News Media, and the Wider Internet
Do we live in a "Golden Age of Conspiracy Theories?" In the last few decades,
conspiracy theories have proliferated on the Internet with some having
dangerous real-world consequences. A large contingent of those who participated
in the January 6th attack on the US Capitol believed fervently in the QAnon
conspiracy theory. In this work, we study the relationships amongst five
prominent conspiracy theories (QAnon, COVID, UFO/Aliens, 9-11, and Flat-Earth)
and each of their respective relationships to the news media, both mainstream
and fringe. Identifying and publishing a set of 755 different conspiracy theory
websites dedicated to our five conspiracy theories, we find that each set often
hyperlinks to the same external domains, with COVID and QAnon conspiracy theory
websites largest amount of shared connections. Examining the role of news
media, we further find that not only do outlets known for spreading
misinformation hyperlink to our set of conspiracy theory websites more often
than mainstream websites but this hyperlinking has increased dramatically
between 2018 and 2021, with the advent of QAnon and the start of COVID-19
pandemic. Using partial Granger-causality, we uncover several positive
correlative relationships between the hyperlinks from misinformation websites
and the popularity of conspiracy theory websites, suggesting the prominent role
that misinformation news outlets play in popularizing many conspiracy theories
Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit
In the buildup to and in the weeks following the Russian Federation's
invasion of Ukraine, Russian state media outlets output torrents of misleading
and outright false information. In this work, we study this coordinated
information campaign in order to understand the most prominent state media
narratives touted by the Russian government to English-speaking audiences. To
do this, we first perform sentence-level topic analysis using the
large-language model MPNet on articles published by ten different pro-Russian
propaganda websites including the new Russian "fact-checking" website
waronfakes.com. Within this ecosystem, we show that smaller websites like
katehon.com were highly effective at publishing topics that were later echoed
by other Russian sites. After analyzing this set of Russian information
narratives, we then analyze their correspondence with narratives and topics of
discussion on the r/Russia and 10 other political subreddits. Using MPNet and a
semantic search algorithm, we map these subreddits' comments to the set of
topics extracted from our set of Russian websites, finding that 39.6% of
r/Russia comments corresponded to narratives from pro-Russian propaganda
websites compared to 8.86% on r/politics.Comment: Accepted to ICWSM 202
- …
