6,317 research outputs found
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
SAVOIAS: A Diverse, Multi-Category Visual Complexity Dataset
Visual complexity identifies the level of intricacy and details in an image
or the level of difficulty to describe the image. It is an important concept in
a variety of areas such as cognitive psychology, computer vision and
visualization, and advertisement. Yet, efforts to create large, downloadable
image datasets with diverse content and unbiased groundtruthing are lacking. In
this work, we introduce Savoias, a visual complexity dataset that compromises
of more than 1,400 images from seven image categories relevant to the above
research areas, namely Scenes, Advertisements, Visualization and infographics,
Objects, Interior design, Art, and Suprematism. The images in each category
portray diverse characteristics including various low-level and high-level
features, objects, backgrounds, textures and patterns, text, and graphics. The
ground truth for Savoias is obtained by crowdsourcing more than 37,000 pairwise
comparisons of images using the forced-choice methodology and with more than
1,600 contributors. The resulting relative scores are then converted to
absolute visual complexity scores using the Bradley-Terry method and matrix
completion. When applying five state-of-the-art algorithms to analyze the
visual complexity of the images in the Savoias dataset, we found that the
scores obtained from these baseline tools only correlate well with crowdsourced
labels for abstract patterns in the Suprematism category (Pearson correlation
r=0.84). For the other categories, in particular, the objects and advertisement
categories, low correlation coefficients were revealed (r=0.3 and 0.56,
respectively). These findings suggest that (1) state-of-the-art approaches are
mostly insufficient and (2) Savoias enables category-specific method
development, which is likely to improve the impact of visual complexity
analysis on specific application areas, including computer vision.Comment: 10 pages, 4 figures, 4 table
Learning From Noisy Singly-labeled Data
Supervised learning depends on annotated examples, which are taken to be the
\emph{ground truth}. But these labels often come from noisy crowdsourcing
platforms, like Amazon Mechanical Turk. Practitioners typically collect
multiple labels per example and aggregate the results to mitigate noise (the
classic crowdsourcing problem). Given a fixed annotation budget and unlimited
unlabeled data, redundant annotation comes at the expense of fewer labeled
examples. This raises two fundamental questions: (1) How can we best learn from
noisy workers? (2) How should we allocate our labeling budget to maximize the
performance of a classifier? We propose a new algorithm for jointly modeling
labels and worker quality from noisy crowd-sourced data. The alternating
minimization proceeds in rounds, estimating worker quality from disagreement
with the current model and then updating the model by optimizing a loss
function that accounts for the current estimate of worker quality. Unlike
previous approaches, even with only one annotation per example, our algorithm
can estimate worker quality. We establish a generalization error bound for
models learned with our algorithm and establish theoretically that it's better
to label many examples once (vs less multiply) when worker quality is above a
threshold. Experiments conducted on both ImageNet (with simulated noisy
workers) and MS-COCO (using the real crowdsourced labels) confirm our
algorithm's benefits.Comment: 18 pages, 3 figure
- …