27 research outputs found
Comparing Bayesian Models of Annotation
The analysis of crowdsourced annotations in NLP is concerned with identifying 1) gold standard labels, 2) annotator accuracies and biases, and 3) item difficulties and error patterns. Traditionally, majority voting was used for 1), and coefficients of agreement for 2) and 3). Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation
Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning
Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information: in addition to the standard error computation, we use soft labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities and thereby mitigates overfitting. It significantly improves performance across tasks beyond the standard approach and prior work
CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies
Various NLP tasks require a complex hierarchical structure over nodes, where
each node is a cluster of items. Examples include generating entailment graphs,
hierarchical cross-document coreference resolution, annotating event and
subevent relations, etc. To enable efficient annotation of such hierarchical
structures, we release CHAMP, an open source tool allowing to incrementally
construct both clusters and hierarchy simultaneously over any type of texts.
This incremental approach significantly reduces annotation time compared to the
common pairwise annotation approach and also guarantees maintaining
transitivity at the cluster and hierarchy levels. Furthermore, CHAMP includes a
consolidation mode, where an adjudicator can easily compare multiple cluster
hierarchy annotations and resolve disagreements.Comment: EMNLP 202
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
Labelled data is the foundation of most natural language processing tasks.
However, labelling data is difficult and there often are diverse valid beliefs
about what the correct data labels should be. So far, dataset creators have
acknowledged annotator subjectivity, but rarely actively managed it in the
annotation process. This has led to partly-subjective datasets that fail to
serve a clear downstream use. To address this issue, we propose two contrasting
paradigms for data annotation. The descriptive paradigm encourages annotator
subjectivity, whereas the prescriptive paradigm discourages it. Descriptive
annotation allows for the surveying and modelling of different beliefs, whereas
prescriptive annotation enables the training of models that consistently apply
one belief. We discuss benefits and challenges in implementing both paradigms,
and argue that dataset creators should explicitly aim for one or the other to
facilitate the intended use of their dataset. Lastly, we conduct an annotation
experiment using hate speech data that illustrates the contrast between the two
paradigms.Comment: Accepted at NAACL 2022 (Main Conference
Two contrasting data annotation paradigms for subjective NLP tasks
Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms
A Bayesian Approach for Sequence Tagging with Crowds
Current methods for sequence tagging, a core task in NLP, are data hungry,
which motivates the use of crowdsourcing as a cheap way to obtain labelled
data. However, annotators are often unreliable and current aggregation methods
cannot capture common types of span annotation errors. To address this, we
propose a Bayesian method for aggregating sequence tags that reduces errors by
modelling sequential dependencies between the annotations as well as the
ground-truth labels. By taking a Bayesian approach, we account for uncertainty
in the model due to both annotator errors and the lack of data for modelling
annotators who complete few tasks. We evaluate our model on crowdsourced data
for named entity recognition, information extraction and argument mining,
showing that our sequential model outperforms the previous state of the art. We
also find that our approach can reduce crowdsourcing costs through more
effective active learning, as it better captures uncertainty in the sequence
labels when there are few annotations.Comment: Accepted for EMNLP 201
HuCurl: Human-induced Curriculum Discovery
We introduce the problem of curriculum discovery and describe a curriculum
learning framework capable of discovering effective curricula in a curriculum
space based on prior knowledge about sample difficulty. Using annotation
entropy and loss as measures of difficulty, we show that (i): the
top-performing discovered curricula for a given model and dataset are often
non-monotonic as opposed to monotonic curricula in existing literature, (ii):
the prevailing easy-to-hard or hard-to-easy transition curricula are often at
the risk of underperforming, and (iii): the curricula discovered for smaller
datasets and models perform well on larger datasets and models respectively.
The proposed framework encompasses some of the existing curriculum learning
approaches and can discover curricula that outperform them across several NLP
tasks.Comment: In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (ACL