PhD ThesesThere is plenty of evidence that humans disagree on the interpretation of many
tasks in Natural Language Processing (nlp) and Computer Vision (cv), from objective
tasks rooted in linguistics such as part-of-speech tagging to more subjective (observerdependent)
tasks such as classifying an image or deciding whether a proposition follows
from a certain premise. While most learning in Artificial Intelligence (ai) still relies
on the assumption that a single interpretation, captured by the gold label, exists for
each item, a growing research body in recent years has focused on learning methods
that do not rely on this assumption. Rather, they aim to learn ranges of truth amidst
disagreement. This PhD research makes a contribution to this field of study.
Firstly, we analytically review the evidence for disagreement on nlp and cv tasks,
focusing on tasks where substantial datasets with such information have been created.
As part of this review, we also discuss the most popular approaches to training
models from datasets containing multiple judgments and group these methods
together according to their handling of disagreement. Secondly, we make three proposals
for learning with disagreement; soft-loss, multi-task learning from gold and
crowds, and automatic temperature-scaled soft-loss. Thirdly, we address one gap in
this field of study – the prevalence of hard metrics for model evaluation even when
the gold assumption is shown to be an idealization – by proposing several previously
existing metrics and novel soft metrics that do not make this assumption and analyzing
the merits and assumptions of all the metrics, hard and soft. Finally, we carry
out a systematic investigation of the key proposals in learning with disagreement by
training them across several tasks, considering several ways to evaluate the resulting
models and assessing the conditions under which each approach is effective. This is
a key contribution of this research as research in learning with disagreement do not
often test proposals across tasks, compare proposals with a variety of approaches, or
evaluate using both soft metrics and hard metrics.
The results obtained suggest, first of all, that it is essential to reach a consensus
on how to evaluate models. This is because the relative performance of the various
training methods is critically affected by the chosen form of evaluation. Secondly,
we observed a strong dataset effect. With substantial datasets, providing many judgments
by high-quality coders for each item, training directly with soft labels achieved
better results than training from aggregated or even gold labels. This result holds for
both hard and soft evaluation. But when the above conditions do not hold, leveraging
both gold and soft labels generally achieved the best results in the hard evaluation.
All datasets and models employed in this paper are freely available as supplementary
materials