50 research outputs found
A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This characteristic makes the corpus a unique resource for the study of disagreements on anaphoric interpretation. A second distinctive feature is its rich annotation scheme, covering singletons, expletives, and split-antecedent plurals. Finally, the corpus also comes with labels inferred using a recently proposed probabilistic model of annotation for coreference. The labels are of high quality and make it possible to successfully train a state of the art coreference resolver, including training on singletons and non-referring expressions. The annotation model can also result in more than one label, or no label, being proposed for a markable, thus serving as a baseline method for automatically identifying ambiguous markables. A preliminary analysis of the results is presented
Learning from disagreement: a survey
Many tasks in Natural Language Processing (nlp) and Computer Vision (cv) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (ai) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on nlp and cv tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
Making the Most of Crowd Information: Learning and Evaluation in AI tasks with Disagreements.
PhD ThesesThere is plenty of evidence that humans disagree on the interpretation of many
tasks in Natural Language Processing (nlp) and Computer Vision (cv), from objective
tasks rooted in linguistics such as part-of-speech tagging to more subjective (observerdependent)
tasks such as classifying an image or deciding whether a proposition follows
from a certain premise. While most learning in Artificial Intelligence (ai) still relies
on the assumption that a single interpretation, captured by the gold label, exists for
each item, a growing research body in recent years has focused on learning methods
that do not rely on this assumption. Rather, they aim to learn ranges of truth amidst
disagreement. This PhD research makes a contribution to this field of study.
Firstly, we analytically review the evidence for disagreement on nlp and cv tasks,
focusing on tasks where substantial datasets with such information have been created.
As part of this review, we also discuss the most popular approaches to training
models from datasets containing multiple judgments and group these methods
together according to their handling of disagreement. Secondly, we make three proposals
for learning with disagreement; soft-loss, multi-task learning from gold and
crowds, and automatic temperature-scaled soft-loss. Thirdly, we address one gap in
this field of study – the prevalence of hard metrics for model evaluation even when
the gold assumption is shown to be an idealization – by proposing several previously
existing metrics and novel soft metrics that do not make this assumption and analyzing
the merits and assumptions of all the metrics, hard and soft. Finally, we carry
out a systematic investigation of the key proposals in learning with disagreement by
training them across several tasks, considering several ways to evaluate the resulting
models and assessing the conditions under which each approach is effective. This is
a key contribution of this research as research in learning with disagreement do not
often test proposals across tasks, compare proposals with a variety of approaches, or
evaluate using both soft metrics and hard metrics.
The results obtained suggest, first of all, that it is essential to reach a consensus
on how to evaluate models. This is because the relative performance of the various
training methods is critically affected by the chosen form of evaluation. Secondly,
we observed a strong dataset effect. With substantial datasets, providing many judgments
by high-quality coders for each item, training directly with soft labels achieved
better results than training from aggregated or even gold labels. This result holds for
both hard and soft evaluation. But when the above conditions do not hold, leveraging
both gold and soft labels generally achieved the best results in the hard evaluation.
All datasets and models employed in this paper are freely available as supplementary
materials
An Annotated Corpus of Reference Resolution for Interpreting Common Grounding
Common grounding is the process of creating, repairing and updating mutual
understandings, which is a fundamental aspect of natural language conversation.
However, interpreting the process of common grounding is a challenging task,
especially under continuous and partially-observable context where complex
ambiguity, uncertainty, partial understandings and misunderstandings are
introduced. Interpretation becomes even more challenging when we deal with
dialogue systems which still have limited capability of natural language
understanding and generation. To address this problem, we consider reference
resolution as the central subtask of common grounding and propose a new
resource to study its intermediate process. Based on a simple and general
annotation schema, we collected a total of 40,172 referring expressions in
5,191 dialogues curated from an existing corpus, along with multiple judgements
of referent interpretations. We show that our annotation is highly reliable,
captures the complexity of common grounding through a natural degree of
reasonable disagreements, and allows for more detailed and quantitative
analyses of common grounding strategies. Finally, we demonstrate the advantages
of our annotation for interpreting, analyzing and improving common grounding in
baseline dialogue systems.Comment: 9 pages, 7 figures, 6 tables, Accepted by AAAI 202
SemEval-2021 Task 12: Learning with Disagreements
Disagreement between coders is ubiquitous in virtually all datasets annotated with human judgements in both natural language processing and computer vision. However, most supervised machine learning methods assume that a single preferred interpretation exists for each item, which is at best an idealization. The aim of the SemEval-2021 shared task on learning with disagreements (Le-Wi-Di) was to provide a unified testing framework for methods for learning from data containing multiple and possibly contradictory annotations covering the best-known datasets containing information about disagreements for interpreting language and classifying images. In this paper we describe the shared task and its results