34 research outputs found
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation
Human variation in labeling is often considered noise. Annotation projects
for machine learning (ML) aim at minimizing human label variation, with the
assumption to maximize data quality and in turn optimize and maximize machine
learning metrics. However, this conventional practice assumes that there exists
a ground truth, and neglects that there exists genuine human variation in
labeling due to disagreement, subjectivity in annotation or multiple plausible
answers. In this position paper, we argue that this big open problem of human
label variation persists and critically needs more attention to move our field
forward. This is because human label variation impacts all stages of the ML
pipeline: data, modeling and evaluation. However, few works consider all of
these dimensions jointly; and existing research is fragmented. We reconcile
different previously proposed notions of human label variation, provide a
repository of publicly-available datasets with un-aggregated labels, depict
approaches proposed so far, identify gaps and suggest ways forward. As datasets
are becoming increasingly available, we hope that this synthesized view on the
'problem' will lead to an open discussion on possible strategies to devise
fundamentally new directions.Comment: EMNLP 202
Making the Most of Crowd Information: Learning and Evaluation in AI tasks with Disagreements.
PhD ThesesThere is plenty of evidence that humans disagree on the interpretation of many
tasks in Natural Language Processing (nlp) and Computer Vision (cv), from objective
tasks rooted in linguistics such as part-of-speech tagging to more subjective (observerdependent)
tasks such as classifying an image or deciding whether a proposition follows
from a certain premise. While most learning in Artificial Intelligence (ai) still relies
on the assumption that a single interpretation, captured by the gold label, exists for
each item, a growing research body in recent years has focused on learning methods
that do not rely on this assumption. Rather, they aim to learn ranges of truth amidst
disagreement. This PhD research makes a contribution to this field of study.
Firstly, we analytically review the evidence for disagreement on nlp and cv tasks,
focusing on tasks where substantial datasets with such information have been created.
As part of this review, we also discuss the most popular approaches to training
models from datasets containing multiple judgments and group these methods
together according to their handling of disagreement. Secondly, we make three proposals
for learning with disagreement; soft-loss, multi-task learning from gold and
crowds, and automatic temperature-scaled soft-loss. Thirdly, we address one gap in
this field of study – the prevalence of hard metrics for model evaluation even when
the gold assumption is shown to be an idealization – by proposing several previously
existing metrics and novel soft metrics that do not make this assumption and analyzing
the merits and assumptions of all the metrics, hard and soft. Finally, we carry
out a systematic investigation of the key proposals in learning with disagreement by
training them across several tasks, considering several ways to evaluate the resulting
models and assessing the conditions under which each approach is effective. This is
a key contribution of this research as research in learning with disagreement do not
often test proposals across tasks, compare proposals with a variety of approaches, or
evaluate using both soft metrics and hard metrics.
The results obtained suggest, first of all, that it is essential to reach a consensus
on how to evaluate models. This is because the relative performance of the various
training methods is critically affected by the chosen form of evaluation. Secondly,
we observed a strong dataset effect. With substantial datasets, providing many judgments
by high-quality coders for each item, training directly with soft labels achieved
better results than training from aggregated or even gold labels. This result holds for
both hard and soft evaluation. But when the above conditions do not hold, leveraging
both gold and soft labels generally achieved the best results in the hard evaluation.
All datasets and models employed in this paper are freely available as supplementary
materials
A Conceptual Probabilistic Framework for Annotation Aggregation of Citizen Science Data
Over the last decade, hundreds of thousands of volunteers have contributed to science by collecting or analyzing data. This public participation in science, also known as citizen science, has contributed to significant discoveries and led to publications in major scientific journals. However, little attention has been paid to data quality issues. In this work we argue that being able to determine the accuracy of data obtained by crowdsourcing is a fundamental question and we point out that, for many real-life scenarios, mathematical tools and processes for the evaluation of data quality are missing. We propose a probabilistic methodology for the evaluation of the accuracy of labeling data obtained by crowdsourcing in citizen science. The methodology builds on an abstract probabilistic graphical model formalism, which is shown to generalize some already existing label aggregation models. We show how to make practical use of the methodology through a comparison of data obtained from different citizen science communities analyzing the earthquake that took place in Albania in 2019
Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model
Modern search engine result pages often provide immediate value to users and
organize information in such a way that it is easy to navigate. The core
ranking function contributes to this and so do result snippets, smart
organization of result blocks and extensive use of one-box answers or side
panels. While they are useful to the user and help search engines to stand out,
such features present two big challenges for evaluation. First, the presence of
such elements on a search engine result page (SERP) may lead to the absence of
clicks, which is, however, not related to dissatisfaction, so-called "good
abandonments." Second, the non-linear layout and visual difference of SERP
items may lead to non-trivial patterns of user attention, which is not captured
by existing evaluation metrics.
In this paper we propose a model of user behavior on a SERP that jointly
captures click behavior, user attention and satisfaction, the CAS model, and
demonstrate that it gives more accurate predictions of user actions and
self-reported satisfaction than existing models based on clicks alone. We use
the CAS model to build a novel evaluation metric that can be applied to
non-linear SERP layouts and that can account for the utility that users obtain
directly on a SERP. We demonstrate that this metric shows better agreement with
user-reported satisfaction than conventional evaluation metrics.Comment: CIKM2016, Proceedings of the 25th ACM International Conference on
Information and Knowledge Management. 201