26,376 research outputs found
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework
Even though many machine algorithms have been proposed for entity resolution,
it remains very challenging to find a solution with quality guarantees. In this
paper, we propose a novel HUman and Machine cOoperation (HUMO) framework for
entity resolution (ER), which divides an ER workload between the machine and
the human. HUMO enables a mechanism for quality control that can flexibly
enforce both precision and recall levels. We introduce the optimization problem
of HUMO, minimizing human cost given a quality requirement, and then present
three optimization approaches: a conservative baseline one purely based on the
monotonicity assumption of precision, a more aggressive one based on sampling
and a hybrid one that can take advantage of the strengths of both previous
approaches. Finally, we demonstrate by extensive experiments on real and
synthetic datasets that HUMO can achieve high-quality results with reasonable
return on investment (ROI) in terms of human cost, and it performs considerably
better than the state-of-the-art alternatives in quality control.Comment: 12 pages, 11 figures. Camera-ready version of the paper submitted to
ICDE 2018, In Proceedings of the 34th IEEE International Conference on Data
Engineering (ICDE 2018
- …