39 research outputs found
SNaC: Coherence Error Detection for Narrative Summarization
Progress in summarizing long texts is inhibited by the lack of appropriate
evaluation frameworks. When a long summary must be produced to appropriately
cover the facets of that text, that summary needs to present a coherent
narrative to be understandable by a reader, but current automatic and human
evaluation methods fail to identify gaps in coherence. In this work, we
introduce SNaC, a narrative coherence evaluation framework rooted in
fine-grained annotations for long summaries. We develop a taxonomy of coherence
errors in generated narrative summaries and collect span-level annotations for
6.6k sentences across 150 book and movie screenplay summaries. Our work
provides the first characterization of coherence errors generated by
state-of-the-art summarization models and a protocol for eliciting coherence
judgments from crowd annotators. Furthermore, we show that the collected
annotations allow us to train a strong classifier for automatically localizing
coherence errors in generated summaries as well as benchmarking past work in
coherence modeling. Finally, our SNaC framework can support future work in long
document summarization and coherence evaluation, including improved
summarization modeling and post-hoc summary correction.Comment: EMNLP 202
Sarcasm Detection in a Disaster Context
During natural disasters, people often use social media platforms such as
Twitter to ask for help, to provide information about the disaster situation,
or to express contempt about the unfolding event or public policies and
guidelines. This contempt is in some cases expressed as sarcasm or irony.
Understanding this form of speech in a disaster-centric context is essential to
improving natural language understanding of disaster-related tweets. In this
paper, we introduce HurricaneSARC, a dataset of 15,000 tweets annotated for
intended sarcasm, and provide a comprehensive investigation of sarcasm
detection using pre-trained language models. Our best model is able to obtain
as much as 0.70 F1 on our dataset. We also demonstrate that the performance on
HurricaneSARC can be improved by leveraging intermediate task transfer
learning. We release our data and code at
https://github.com/tsosea2/HurricaneSarc
Evaluating Subjective Cognitive Appraisals of Emotions from Large Language Models
The emotions we experience involve complex processes; besides physiological
aspects, research in psychology has studied cognitive appraisals where people
assess their situations subjectively, according to their own values (Scherer,
2005). Thus, the same situation can often result in different emotional
experiences. While the detection of emotion is a well-established task, there
is very limited work so far on the automatic prediction of cognitive
appraisals. This work fills the gap by presenting CovidET-Appraisals, the most
comprehensive dataset to-date that assesses 24 appraisal dimensions, each with
a natural language rationale, across 241 Reddit posts. CovidET-Appraisals
presents an ideal testbed to evaluate the ability of large language models --
excelling at a wide range of NLP tasks -- to automatically assess and explain
cognitive appraisals. We found that while the best models are performant,
open-sourced LLMs fall short at this task, presenting a new challenge in the
future development of emotionally intelligent models. We release our dataset at
https://github.com/honglizhan/CovidET-Appraisals-Public.Comment: EMNLP 2023 (Findings) Camera-Ready Versio