2,938,257 research outputs found
Influence of study design on digital pathology image quality evaluation : the need to define a clinical task
Despite the current rapid advance in technologies for whole slide imaging, there is still no scientific consensus on the recommended methodology for image quality assessment of digital pathology slides. For medical images in general, it has been recommended to assess image quality in terms of doctors’ success rates in performing a specific clinical task while using the images (clinical image quality, cIQ). However, digital pathology is a new modality, and already identifying the appropriate task is difficult. In an alternative common approach, humans are asked to do a simpler task such as rating overall image quality (perceived image quality, pIQ), but that involves the risk of nonclinically relevant findings due to an unknown relationship between the pIQ and cIQ. In this study, we explored three different experimental protocols: (1) conducting a clinical task (detecting inclusion bodies), (2) rating image similarity and preference, and (3) rating the overall image quality. Additionally, within protocol 1, overall quality ratings were also collected (task-aware pIQ). The experiments were done by diagnostic veterinary pathologists in the context of evaluating the quality of hematoxylin and eosin-stained digital pathology slides of animal tissue samples under several common image alterations: additive noise, blurring, change in gamma, change in color saturation, and JPG compression. While the size of our experiments was small and prevents drawing strong conclusions, the results suggest the need to define a clinical task. Importantly, the pIQ data collected under protocols 2 and 3 did not always rank the image alterations the
same as their cIQ from protocol 1, warning against using conventional pIQ to predict cIQ. At the same time, there was a correlation between the cIQ and task-aware pIQ ratings from protocol 1, suggesting that the clinical experiment context (set by specifying the clinical task) may affect human visual attention and bring focus to their criteria of image quality. Further research is needed to assess whether and for which purposes (e.g., preclinical testing) task-aware pIQ ratings could substitute cIQ for a given clinical task
Exploring Prediction Uncertainty in Machine Translation Quality Estimation
Machine Translation Quality Estimation is a notoriously difficult task, which
lessens its usefulness in real-world translation environments. Such scenarios
can be improved if quality predictions are accompanied by a measure of
uncertainty. However, models in this task are traditionally evaluated only in
terms of point estimate metrics, which do not take prediction uncertainty into
account. We investigate probabilistic methods for Quality Estimation that can
provide well-calibrated uncertainty estimates and evaluate them in terms of
their full posterior predictive distributions. We also show how this posterior
information can be useful in an asymmetric risk scenario, which aims to capture
typical situations in translation workflows.Comment: Proceedings of CoNLL 201
A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality
Microtask crowdsourcing is increasingly critical to the creation of extremely
large datasets. As a result, crowd workers spend weeks or months repeating the
exact same tasks, making it necessary to understand their behavior over these
long periods of time. We utilize three large, longitudinal datasets of nine
million annotations collected from Amazon Mechanical Turk to examine claims
that workers fatigue or satisfice over these long periods, producing lower
quality work. We find that, contrary to these claims, workers are extremely
stable in their quality over the entire period. To understand whether workers
set their quality based on the task's requirements for acceptance, we then
perform an experiment where we vary the required quality for a large
crowdsourcing task. Workers did not adjust their quality based on the
acceptance threshold: workers who were above the threshold continued working at
their usual quality level, and workers below the threshold self-selected
themselves out of the task. Capitalizing on this consistency, we demonstrate
that it is possible to predict workers' long-term quality using just a glimpse
of their quality on the first five tasks.Comment: 10 pages, 11 figures, accepted CSCW 201
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
- …
