6,822 research outputs found
Chain-of-Thought Embeddings for Stance Detection on Social Media
Stance detection on social media is challenging for Large Language Models
(LLMs), as emerging slang and colloquial language in online conversations often
contain deeply implicit stance labels. Chain-of-Thought (COT) prompting has
recently been shown to improve performance on stance detection tasks --
alleviating some of these issues. However, COT prompting still struggles with
implicit stance identification. This challenge arises because many samples are
initially challenging to comprehend before a model becomes familiar with the
slang and evolving knowledge related to different topics, all of which need to
be acquired through the training data. In this study, we address this problem
by introducing COT Embeddings which improve COT performance on stance detection
tasks by embedding COT reasonings and integrating them into a traditional
RoBERTa-based stance detection pipeline. Our analysis demonstrates that 1) text
encoders can leverage COT reasonings with minor errors or hallucinations that
would otherwise distort the COT output label. 2) Text encoders can overlook
misleading COT reasoning when a sample's prediction heavily depends on
domain-specific patterns. Our model achieves SOTA performance on multiple
stance detection datasets collected from social media.Comment: Accepted at EMNLP-2023, 8 page
Learning to Identify Ambiguous and Misleading News Headlines
Accuracy is one of the basic principles of journalism. However, it is
increasingly hard to manage due to the diversity of news media. Some editors of
online news tend to use catchy headlines which trick readers into clicking.
These headlines are either ambiguous or misleading, degrading the reading
experience of the audience. Thus, identifying inaccurate news headlines is a
task worth studying. Previous work names these headlines "clickbaits" and
mainly focus on the features extracted from the headlines, which limits the
performance since the consistency between headlines and news bodies is
underappreciated. In this paper, we clearly redefine the problem and identify
ambiguous and misleading headlines separately. We utilize class sequential
rules to exploit structure information when detecting ambiguous headlines. For
the identification of misleading headlines, we extract features based on the
congruence between headlines and bodies. To make use of the large unlabeled
data set, we apply a co-training method and gain an increase in performance.
The experiment results show the effectiveness of our methods. Then we use our
classifiers to detect inaccurate headlines crawled from different sources and
conduct a data analysis.Comment: Accepted by IJCAI 201
Is This a Joke? Detecting Humor in Spanish Tweets
While humor has been historically studied from a psychological, cognitive and
linguistic standpoint, its study from a computational perspective is an area
yet to be explored in Computational Linguistics. There exist some previous
works, but a characterization of humor that allows its automatic recognition
and generation is far from being specified. In this work we build a
crowdsourced corpus of labeled tweets, annotated according to its humor value,
letting the annotators subjectively decide which are humorous. A humor
classifier for Spanish tweets is assembled based on supervised learning,
reaching a precision of 84% and a recall of 69%.Comment: Preprint version, without referra
Diagnose network failures via data-plane analysis
Diagnosing problems in networks is a time-consuming and error-prone process. Previous tools to assist operators primarily focus on analyzing control
plane configuration. Configuration analysis is limited in that it cannot find
bugs in router software, and is harder to generalize across protocols since it
must model complex configuration languages and dynamic protocol behavior.
This paper studies an alternate approach: diagnosing problems through
static analysis of the data plane. This approach can catch bugs that are
invisible at the level of configuration files, and simplifies unified analysis of a
network across many protocols and implementations. We present Anteater, a
tool for checking invariants in the data plane. Anteater translates high-level
network invariants into boolean satisfiability problems, checks them against
network state using a SAT solver, and reports counterexamples if violations
have been found. Applied to a large campus network, Anteater revealed 23
bugs, including forwarding loops and stale ACL rules, with only five false
positives. Nine of these faults are being fixed by campus network operators
Race, Religion and the City: Twitter Word Frequency Patterns Reveal Dominant Demographic Dimensions in the United States
Recently, numerous approaches have emerged in the social sciences to exploit
the opportunities made possible by the vast amounts of data generated by online
social networks (OSNs). Having access to information about users on such a
scale opens up a range of possibilities, all without the limitations associated
with often slow and expensive paper-based polls. A question that remains to be
satisfactorily addressed, however, is how demography is represented in the OSN
content? Here, we study language use in the US using a corpus of text compiled
from over half a billion geo-tagged messages from the online microblogging
platform Twitter. Our intention is to reveal the most important spatial
patterns in language use in an unsupervised manner and relate them to
demographics. Our approach is based on Latent Semantic Analysis (LSA) augmented
with the Robust Principal Component Analysis (RPCA) methodology. We find
spatially correlated patterns that can be interpreted based on the words
associated with them. The main language features can be related to slang use,
urbanization, travel, religion and ethnicity, the patterns of which are shown
to correlate plausibly with traditional census data. Our findings thus validate
the concept of demography being represented in OSN language use and show that
the traits observed are inherently present in the word frequencies without any
previous assumptions about the dataset. Thus, they could form the basis of
further research focusing on the evaluation of demographic data estimation from
other big data sources, or on the dynamical processes that result in the
patterns found here
- …