168 research outputs found
"i have a feeling trump will win..................": Forecasting Winners and Losers from User Predictions on Twitter
Social media users often make explicit predictions about upcoming events.
Such statements vary in the degree of certainty the author expresses toward the
outcome:"Leonardo DiCaprio will win Best Actor" vs. "Leonardo DiCaprio may win"
or "No way Leonardo wins!". Can popular beliefs on social media predict who
will win? To answer this question, we build a corpus of tweets annotated for
veridicality on which we train a log-linear classifier that detects positive
veridicality with high precision. We then forecast uncertain outcomes using the
wisdom of crowds, by aggregating users' explicit predictions. Our method for
forecasting winners is fully automated, relying only on a set of contenders as
input. It requires no training data of past outcomes and outperforms sentiment
and tweet volume baselines on a broad range of contest prediction tasks. We
further demonstrate how our approach can be used to measure the reliability of
individual accounts' predictions and retrospectively identify surprise
outcomes.Comment: Accepted at EMNLP 2017 (long paper
Investigating Reasons for Disagreement in Natural Language Inference
We investigate how disagreement in natural language inference (NLI)
annotation arises. We developed a taxonomy of disagreement sources with 10
categories spanning 3 high-level classes. We found that some disagreements are
due to uncertainty in the sentence meaning, others to annotator biases and task
artifacts, leading to different interpretations of the label distribution. We
explore two modeling approaches for detecting items with potential
disagreement: a 4-way classification with a "Complicated" label in addition to
the three standard NLI labels, and a multilabel classification approach. We
found that the multilabel classification is more expressive and gives better
recall of the possible interpretations in the data.Comment: accepted at TACL, pre-MIT Press publication versio
The prosody of presupposition projection in naturally-occurring utterances
In experimental studies, prosodically-marked pragmatic focus has been found to influence the projection of factive presuppositions of utterances like these parents didn’t know the kid was gone (Cummins and Rohde, 2015; Tonhauser, 2016; Dj¨arv and Bacovcin, 2017), supporting question-based analyses of projection (i.a., Abrus´an, 2011; Abrus´an, 2016; Simons et al., 2017; Beaver et al., 2017). However, no prior work has explored whether this effect extends to naturally-occurring utterances. In a large set of naturally-occurring utterances, we find that prosodically-marked focus influences projection in utterances with factive embedding predicates, but not those with non-factive predicates. We argue that our findings support an account where lexical semantics of the predicate contributes to projection to the extent that they admit QUD alternatives that can be assumed to entail the content of the complement
Ecologically Valid Explanations for Label Variation in NLI
Human label variation, or annotation disagreement, exists in many natural
language processing (NLP) tasks, including natural language inference (NLI). To
gain direct evidence of how NLI label variation arises, we build LiveNLI, an
English dataset of 1,415 ecologically valid explanations (annotators explain
the NLI labels they chose) for 122 MNLI items (at least 10 explanations per
item). The LiveNLI explanations confirm that people can systematically vary on
their interpretation and highlight within-label variation: annotators sometimes
choose the same label for different reasons. This suggests that explanations
are crucial for navigating label interpretations in general. We few-shot prompt
large language models to generate explanations but the results are
inconsistent: they sometimes produces valid and informative explanations, but
it also generates implausible ones that do not support the label, highlighting
directions for improvement.Comment: Findings at EMNLP 2023. Overlap with previous version
arXiv:2304.1244
The Overall Markedness of Discourse Relations
Abstract Discourse relations can be categorized as continuous or discontinuous in the hypothesis of continuit
Challenges and solutions for Latin named entity recognition
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity
Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track
the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree
of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality
Conversion et améliorations de corpus du français annotés en Universal Dependencies
International audienceThis paper describes an effort to improve the consistency of two French corpora annotated with the Universal Dependencies (UD) scheme. The Universal Dependencies project aims at building a syntactic dependency scheme which allows similar analyses for several different languages. We improved the annotations of the two French corpora to render them closer to the UD scheme, and evaluated the changes done to the corpora in terms of closeness to the UD scheme as well as of internal corpus consistency.Cet article décrit l'effort d'amélioration de deux corpus du français annotés en dépendances syntaxiques, qui s'inscrit dans le cadre du projet Universal Dependencies (UD) qui vise à élaborer un schéma d'annotation syntaxique permettant d'analyser de façon similaire plusieurs langues différentes. Nous avons cherché à rendre plus conformes au schéma UD ces deux corpus du français, et nous avons évalué l'impact des modifications apportées aux corpus sur la conformité avec le schéma UD et la cohérence interne de leur annotation
NeuralStory: an Interactive Multimedia System for Video Indexing and Re-use
In the last years video has been swamping the Internet: websites, social networks, and business multimedia systems are adopting video as the most important form of communication and information. Video are normally accessed as a whole and are not indexed in the visual content. Thus, they are often uploaded as short, manually cut clips with user-provided annotations, keywords and tags for retrieval.
In this paper, we propose a prototype multimedia system which addresses these two limitations: it overcomes the need of human intervention in the video setting, thanks to fully deep learning-based solutions, and decomposes the storytelling structure of the video into coherent parts. These parts can be shots, key-frames, scenes and semantically related stories, and are exploited to provide an automatic annotation of the visual content, so that parts of video can be easily retrieved. This also allows a principled re-use of the video itself: users of the platform can indeed produce new storytelling by means of multi-modal presentations, add text and other media, and propose a different visual organization of the content. We present the overall solution, and some experiments on the re-use capability of our platform in edutainment by conducting an extensive user valuation %with students from primary schools
- …