8,290 research outputs found
Crowd disagreement about medical images is informative
Classifiers for medical image analysis are often trained with a single
consensus label, based on combining labels given by experts or crowds. However,
disagreement between annotators may be informative, and thus removing it may
not be the best strategy. As a proof of concept, we predict whether a skin
lesion from the ISIC 2017 dataset is a melanoma or not, based on crowd
annotations of visual characteristics of that lesion. We compare using the mean
annotations, illustrating consensus, to standard deviations and other
distribution moments, illustrating disagreement. We show that the mean
annotations perform best, but that the disagreement measures are still
informative. We also make the crowd annotations used in this paper available at
\url{https://figshare.com/s/5cbbce14647b66286544}.Comment: Accepted for publication at MICCAI LABELS 201
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
Precise Proximal Femur Fracture Classification for Interactive Training and Surgical Planning
We demonstrate the feasibility of a fully automatic computer-aided diagnosis
(CAD) tool, based on deep learning, that localizes and classifies proximal
femur fractures on X-ray images according to the AO classification. The
proposed framework aims to improve patient treatment planning and provide
support for the training of trauma surgeon residents. A database of 1347
clinical radiographic studies was collected. Radiologists and trauma surgeons
annotated all fractures with bounding boxes, and provided a classification
according to the AO standard. The proposed CAD tool for the classification of
radiographs into types "A", "B" and "not-fractured", reaches a F1-score of 87%
and AUC of 0.95, when classifying fractures versus not-fractured cases it
improves up to 94% and 0.98. Prior localization of the fracture results in an
improvement with respect to full image classification. 100% of the predicted
centers of the region of interest are contained in the manually provided
bounding boxes. The system retrieves on average 9 relevant images (from the
same class) out of 10 cases. Our CAD scheme localizes, detects and further
classifies proximal femur fractures achieving results comparable to
expert-level and state-of-the-art performance. Our auxiliary localization model
was highly accurate predicting the region of interest in the radiograph. We
further investigated several strategies of verification for its adoption into
the daily clinical routine. A sensitivity analysis of the size of the ROI and
image retrieval as a clinical use case were presented.Comment: Accepted at IPCAI 2020 and IJCAR
Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure
As machine learning systems move from computer-science laboratories into the
open world, their accountability becomes a high priority problem.
Accountability requires deep understanding of system behavior and its failures.
Current evaluation methods such as single-score error metrics and confusion
matrices provide aggregate views of system performance that hide important
shortcomings. Understanding details about failures is important for identifying
pathways for refinement, communicating the reliability of systems in different
settings, and for specifying appropriate human oversight and engagement.
Characterization of failures and shortcomings is particularly complex for
systems composed of multiple machine learned components. For such systems,
existing evaluation methods have limited expressiveness in describing and
explaining the relationship among input content, the internal states of system
components, and final output quality. We present Pandora, a set of hybrid
human-machine methods and tools for describing and explaining system failures.
Pandora leverages both human and system-generated observations to summarize
conditions of system malfunction with respect to the input content and system
architecture. We share results of a case study with a machine learning pipeline
for image captioning that show how detailed performance views can be beneficial
for analysis and debugging
Learning from disagreement: a survey
Many tasks in Natural Language Processing (nlp) and Computer Vision (cv) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (ai) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on nlp and cv tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials
ENHANCE (ENriching Health data by ANnotations of Crowd and Experts): A case study for skin lesion classification
We present ENHANCE, an open dataset with multiple annotations to complement
the existing ISIC and PH2 skin lesion classification datasets. This dataset
contains annotations of visual ABC (asymmetry, border, colour) features from
non-expert annotation sources: undergraduate students, crowd workers from
Amazon MTurk and classic image processing algorithms. In this paper we first
analyse the correlations between the annotations and the diagnostic label of
the lesion, as well as study the agreement between different annotation
sources. Overall we find weak correlations of non-expert annotations with the
diagnostic label, and low agreement between different annotation sources. We
then study multi-task learning (MTL) with the annotations as additional labels,
and show that non-expert annotations can improve (ensembles of)
state-of-the-art convolutional neural networks via MTL. We hope that our
dataset can be used in further research into multiple annotations and/or MTL.
All data and models are available on Github:
https://github.com/raumannsr/ENHANCE
Human Computation and Convergence
Humans are the most effective integrators and producers of information,
directly and through the use of information-processing inventions. As these
inventions become increasingly sophisticated, the substantive role of humans in
processing information will tend toward capabilities that derive from our most
complex cognitive processes, e.g., abstraction, creativity, and applied world
knowledge. Through the advancement of human computation - methods that leverage
the respective strengths of humans and machines in distributed
information-processing systems - formerly discrete processes will combine
synergistically into increasingly integrated and complex information processing
systems. These new, collective systems will exhibit an unprecedented degree of
predictive accuracy in modeling physical and techno-social processes, and may
ultimately coalesce into a single unified predictive organism, with the
capacity to address societies most wicked problems and achieve planetary
homeostasis.Comment: Pre-publication draft of chapter. 24 pages, 3 figures; added
references to page 1 and 3, and corrected typ
- …