559 research outputs found
EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets
This article introduces a new language-independent approach for creating a
large-scale high-quality test collection of tweets that supports multiple
information retrieval (IR) tasks without running a shared-task campaign. The
adopted approach (demonstrated over Arabic tweets) designs the collection
around significant (i.e., popular) events, which enables the development of
topics that represent frequent information needs of Twitter users for which
rich content exists. That inherently facilitates the support of multiple tasks
that generally revolve around events, namely event detection, ad-hoc search,
timeline generation, and real-time summarization. The key highlights of the
approach include diversifying the judgment pool via interactive search and
multiple manually-crafted queries per topic, collecting high-quality
annotations via crowd-workers for relevancy and in-house annotators for
novelty, filtering out low-agreement topics and inaccessible tweets, and
providing multiple subsets of the collection for better availability. Applying
our methodology on Arabic tweets resulted in EveTAR , the first
freely-available tweet test collection for multiple IR tasks. EveTAR includes a
crawl of 355M Arabic tweets and covers 50 significant events for which about
62K tweets were judged with substantial average inter-annotator agreement
(Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating
existing algorithms in the respective tasks. Results indicate that the new
collection can support reliable ranking of IR systems that is comparable to
similar TREC collections, while providing strong baseline results for future
studies over Arabic tweets
Understanding Bots on Social Media - An Application in Disaster Response
abstract: Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are affected by disasters, volunteers who offer contributions, and first responders. On the other hand, social media is a fertile ground for malicious users who purposefully disturb the relief processes facilitated on social media. These malicious users take advantage of social bots to overrun social media posts with fake images, rumors, and false information. This process causes distress and prevents actionable information from reaching the affected people. Social bots are automated accounts that are controlled by a malicious user and these bots have become prevalent on social media in recent years.
In spite of existing efforts towards understanding and removing bots on social media, there are at least two drawbacks associated with the current bot detection algorithms: general-purpose bot detection methods are designed to be conservative and not label a user as a bot unless the algorithm is highly confident and they overlook the effect of users who are manipulated by bots and (unintentionally) spread their content. This study is trifold. First, I design a Machine Learning model that uses content and context of social media posts to detect actionable ones among them; it specifically focuses on tweets in which people ask for help after major disasters. Second, I focus on bots who can be a facilitator of malicious content spreading during disasters. I propose two methods for detecting bots on social media with a focus on the recall of the detection. Third, I study the characteristics of users who spread the content of malicious actors. These features have the potential to improve methods that detect malicious content such as fake news.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Automatic summarization of real world events using Twitter
Microblogging sites, such as Twitter, have become increasingly popular in recent years for reporting details of real world events via the Web. Smartphone apps enable people to communicate with a global audience to express their opinion and commentate on ongoing situations - often while geographically proximal to the event. Due to the heterogeneity and scale of the data and the fact that some messages are more salient than others for the purposes of understanding any risk to human safety and managing any disruption caused by events, automatic summarization of event-related microblogs is a non-trivial and important problem. In this paper we tackle the task of automatic summarization of Twitter posts, and present three methods that produce summaries by selecting the most representative posts from real-world tweet-event clusters. To evaluate our approaches, we compare them to the state-of-the-art summarization systems and human generated summaries. Our results show that our proposed methods outperform all the other summarization systems for English and non-English corpora
Developing natural language processing instruments to study sociotechnical systems
Identifying temporal linguistic patterns and tracing social amplification across communities has always been vital to understanding modern sociotechnical systems. Now, well into the age of information technology, the growing digitization of text archives powered by machine learning systems has enabled an enormous number of interdisciplinary studies to examine the coevolution of language and culture. However, most research in that domain investigates formal textual records, such as books and newspapers. In this work, I argue that the study of conversational text derived from social media is just as important. I present four case studies to identify and investigate societal developments in longitudinal social media streams with high temporal resolution spanning over 100 languages. These case studies show how everyday conversations on social media encode a unique perspective that is often complementary to observations derived from more formal texts. This unique perspective improves our understanding of modern sociotechnical systems and enables future research in computational linguistics, social science, and behavioral science
Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media
We present an overview of the third edition of the CheckThat! Lab at CLEF
2020. The lab featured five tasks in two different languages: English and
Arabic. The first four tasks compose the full pipeline of claim verification in
social media: Task 1 on check-worthiness estimation, Task 2 on retrieving
previously fact-checked claims, Task 3 on evidence retrieval, and Task 4 on
claim verification. The lab is completed with Task 5 on check-worthiness
estimation in political debates and speeches. A total of 67 teams registered to
participate in the lab (up from 47 at CLEF 2019), and 23 of them actually
submitted runs (compared to 14 at CLEF 2019). Most teams used deep neural
networks based on BERT, LSTMs, or CNNs, and achieved sizable improvements over
the baselines on all tasks. Here we describe the tasks setup, the evaluation
results, and a summary of the approaches used by the participants, and we
discuss some lessons learned. Last but not least, we release to the research
community all datasets from the lab as well as the evaluation scripts, which
should enable further research in the important tasks of check-worthiness
estimation and automatic claim verification.Comment: Check-Worthiness Estimation, Fact-Checking, Veracity, Evidence-based
Verification, Detecting Previously Fact-Checked Claims, Social Media
Verification, Computational Journalism, COVID-1
- …