1,903 research outputs found
Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems
Crowdsourcing systems commonly face the problem of aggregating multiple
judgments provided by potentially unreliable workers. In addition, several
aspects of the design of efficient crowdsourcing processes, such as defining
worker's bonuses, fair prices and time limits of the tasks, involve knowledge
of the likely duration of the task at hand. Bringing this together, in this
work we introduce a new time--sensitive Bayesian aggregation method that
simultaneously estimates a task's duration and obtains reliable aggregations of
crowdsourced judgments. Our method, called BCCTime, builds on the key insight
that the time taken by a worker to perform a task is an important indicator of
the likely quality of the produced judgment. To capture this, BCCTime uses
latent variables to represent the uncertainty about the workers' completion
time, the tasks' duration and the workers' accuracy. To relate the quality of a
judgment to the time a worker spends on a task, our model assumes that each
task is completed within a latent time window within which all workers with a
propensity to genuinely attempt the labelling task (i.e., no spammers) are
expected to submit their judgments. In contrast, workers with a lower
propensity to valid labeling, such as spammers, bots or lazy labelers, are
assumed to perform tasks considerably faster or slower than the time required
by normal workers. Specifically, we use efficient message-passing Bayesian
inference to learn approximate posterior probabilities of (i) the confusion
matrix of each worker, (ii) the propensity to valid labeling of each worker,
(iii) the unbiased duration of each task and (iv) the true label of each task.
Using two real-world public datasets for entity linking tasks, we show that
BCCTime produces up to 11% more accurate classifications and up to 100% more
informative estimates of a task's duration compared to state-of-the-art
methods
Leveraging Crowdsourcing Data For Deep Active Learning - An Application: Learning Intents in Alexa
This paper presents a generic Bayesian framework that enables any deep
learning model to actively learn from targeted crowds. Our framework inherits
from recent advances in Bayesian deep learning, and extends existing work by
considering the targeted crowdsourcing approach, where multiple annotators with
unknown expertise contribute an uncontrolled amount (often limited) of
annotations. Our framework leverages the low-rank structure in annotations to
learn individual annotator expertise, which then helps to infer the true labels
from noisy and sparse annotations. It provides a unified Bayesian model to
simultaneously infer the true labels and train the deep learning model in order
to reach an optimal learning efficacy. Finally, our framework exploits the
uncertainty of the deep learning model during prediction as well as the
annotators' estimated expertise to minimize the number of required annotations
and annotators for optimally training the deep learning model.
We evaluate the effectiveness of our framework for intent classification in
Alexa (Amazon's personal assistant), using both synthetic and real-world
datasets. Experiments show that our framework can accurately learn annotator
expertise, infer true labels, and effectively reduce the amount of annotations
in model training as compared to state-of-the-art approaches. We further
discuss the potential of our proposed framework in bridging machine learning
and crowdsourcing towards improved human-in-the-loop systems
Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs
Crowdsourcing platforms are now extensively used for conducting subjective
pairwise comparison studies. In this setting, a pairwise comparison dataset is
typically gathered via random sampling, either \emph{with} or \emph{without}
replacement. In this paper, we use tools from random graph theory to analyze
these two random sampling methods for the HodgeRank estimator. Using the
Fiedler value of the graph as a measurement for estimator stability
(informativeness), we provide a new estimate of the Fiedler value for these two
random graph models. In the asymptotic limit as the number of vertices tends to
infinity, we prove the validity of the estimate. Based on our findings, for a
small number of items to be compared, we recommend a two-stage sampling
strategy where a greedy sampling method is used initially and random sampling
\emph{without} replacement is used in the second stage. When a large number of
items is to be compared, we recommend random sampling with replacement as this
is computationally inexpensive and trivially parallelizable. Experiments on
synthetic and real-world datasets support our analysis
Modelling Instance-Level Annotator Reliability for Natural Language Labelling Tasks
When constructing models that learn from noisy labels produced by multiple
annotators, it is important to accurately estimate the reliability of
annotators. Annotators may provide labels of inconsistent quality due to their
varying expertise and reliability in a domain. Previous studies have mostly
focused on estimating each annotator's overall reliability on the entire
annotation task. However, in practice, the reliability of an annotator may
depend on each specific instance. Only a limited number of studies have
investigated modelling per-instance reliability and these only considered
binary labels. In this paper, we propose an unsupervised model which can handle
both binary and multi-class labels. It can automatically estimate the
per-instance reliability of each annotator and the correct label for each
instance. We specify our model as a probabilistic model which incorporates
neural networks to model the dependency between latent variables and instances.
For evaluation, the proposed method is applied to both synthetic and real data,
including two labelling tasks: text classification and textual entailment.
Experimental results demonstrate our novel method can not only accurately
estimate the reliability of annotators across different instances, but also
achieve superior performance in predicting the correct labels and detecting the
least reliable annotators compared to state-of-the-art baselines.Comment: 9 pages, 1 figures, 10 tables, 2019 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL2019
Fusing Continuous-valued Medical Labels using a Bayesian Model
With the rapid increase in volume of time series medical data available
through wearable devices, there is a need to employ automated algorithms to
label data. Examples of labels include interventions, changes in activity (e.g.
sleep) and changes in physiology (e.g. arrhythmias). However, automated
algorithms tend to be unreliable resulting in lower quality care. Expert
annotations are scarce, expensive, and prone to significant inter- and
intra-observer variance. To address these problems, a Bayesian
Continuous-valued Label Aggregator(BCLA) is proposed to provide a reliable
estimation of label aggregation while accurately infer the precision and bias
of each algorithm. The BCLA was applied to QT interval (pro-arrhythmic
indicator) estimation from the electrocardiogram using labels from the 2006
PhysioNet/Computing in Cardiology Challenge database. It was compared to the
mean, median, and a previously proposed Expectation Maximization (EM) label
aggregation approaches. While accurately predicting each labelling algorithm's
bias and precision, the root-mean-square error of the BCLA was
11.780.63ms, significantly outperforming the best Challenge entry
(15.372.13ms) as well as the EM, mean, and median voting strategies
(14.760.52ms, 17.610.55ms, and 14.430.57ms respectively with
)
- …