Many questions that we ask about the world do not have a single clear answer,
yet typical human annotation set-ups in machine learning assume there must be a
single ground truth label for all examples in every task. The divergence
between reality and practice is stark, especially in cases with inherent
ambiguity and where the range of different subjective judgments is wide. Here,
we examine the implications of subjective human judgments in the behavioral
task of labeling images used to train machine vision models. We identify three
primary sources of ambiguity arising from (i) depictions of labels in the
images, (ii) raters' backgrounds, and (iii) the task definition. On the basis
of the empirical results, we suggest best practices for handling label
ambiguity in machine learning datasets