4,237 research outputs found
Fault-Tolerant Entity Resolution with the Crowd
In recent years, crowdsourcing is increasingly applied as a means to enhance
data quality. Although the crowd generates insightful information especially
for complex problems such as entity resolution (ER), the output quality of
crowd workers is often noisy. That is, workers may unintentionally generate
false or contradicting data even for simple tasks. The challenge that we
address in this paper is how to minimize the cost for task requesters while
maximizing ER result quality under the assumption of unreliable input from the
crowd. For that purpose, we first establish how to deduce a consistent ER
solution from noisy worker answers as part of the data interpretation problem.
We then focus on the next-crowdsource problem which is to find the next task
that maximizes the information gain of the ER result for the minimal additional
cost. We compare our robust data interpretation strategies to alternative
state-of-the-art approaches that do not incorporate the notion of
fault-tolerance, i.e., the robustness to noise. In our experimental evaluation
we show that our approaches yield a quality improvement of at least 20% for two
real-world datasets. Furthermore, we examine task-to-worker assignment
strategies as well as task parallelization techniques in terms of their cost
and quality trade-offs in this paper. Based on both synthetic and crowdsourced
datasets, we then draw conclusions on how to minimize cost while maintaining
high quality ER results
Crowd-Powered Data Mining
Many data mining tasks cannot be completely addressed by auto- mated
processes, such as sentiment analysis and image classification. Crowdsourcing
is an effective way to harness the human cognitive ability to process these
machine-hard tasks. Thanks to public crowdsourcing platforms, e.g., Amazon
Mechanical Turk and Crowd- Flower, we can easily involve hundreds of thousands
of ordinary workers (i.e., the crowd) to address these machine-hard tasks. In
this tutorial, we will survey and synthesize a wide spectrum of existing
studies on crowd-powered data mining. We first give an overview of
crowdsourcing, and then summarize the fundamental techniques, including quality
control, cost control, and latency control, which must be considered in
crowdsourced data mining. Next we review crowd-powered data mining operations,
including classification, clustering, pattern mining, machine learning using
the crowd (including deep learning, transfer learning and semi-supervised
learning) and knowledge discovery. Finally, we provide the emerging challenges
in crowdsourced data mining
CrowdER: Crowdsourcing Entity Resolution
Entity resolution is central to data integration and data cleaning.
Algorithmic approaches have been improving in quality, but remain far from
perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow)
way to bring human insight into the process. Previous work has proposed
batching verification tasks for presentation to human workers but even with
batching, a human-only approach is infeasible for data sets of even moderate
size, due to the large numbers of matches to be tested. Instead, we propose a
hybrid human-machine approach in which machines are used to do an initial,
coarse pass over all the data, and people are used to verify only the most
likely matching pairs. We show that for such a hybrid system, generating the
minimum number of verification tasks of a given size is NP-Hard, but we develop
a novel two-tiered heuristic approach for creating batched tasks. We describe
this method, and present the results of extensive experiments on real data sets
using a popular crowdsourcing platform. The experiments show that our hybrid
approach achieves both good efficiency and high accuracy compared to
machine-only or human-only alternatives.Comment: VLDB201
CrowdGather: Entity Extraction over Structured Domains
Crowdsourced entity extraction is often used to acquire data for many
applications, including recommendation systems, construction of aggregated
listings and directories, and knowledge base construction. Current solutions
focus on entity extraction using a single query, e.g., only using "give me
another restaurant", when assembling a list of all restaurants. Due to the cost
of human labor, solutions that focus on a single query can be highly
impractical.
In this paper, we leverage the fact that entity extraction often focuses on
{\em structured domains}, i.e., domains that are described by a collection of
attributes, each potentially exhibiting hierarchical structure. Given such a
domain, we enable a richer space of queries, e.g., "give me another Moroccan
restaurant in Manhattan that does takeout". Naturally, enabling a richer space
of queries comes with a host of issues, especially since many queries return
empty answers. We develop new statistical tools that enable us to reason about
the gain of issuing {\em additional queries} given little to no information,
and show how we can exploit the overlaps across the results of queries for
different points of the data domain to obtain accurate estimates of the gain.
We cast the problem of {\em budgeted entity extraction} over large domains as
an adaptive optimization problem that seeks to maximize the number of extracted
entities, while minimizing the overall extraction costs. We evaluate our
techniques with experiments on both synthetic and real-world datasets,
demonstrating a yield of up to 4X over competing approaches for the same
budget
Clustering with Noisy Queries
In this paper, we initiate a rigorous theoretical study of clustering with
noisy queries (or a faulty oracle). Given a set of elements, our goal is to
recover the true clustering by asking minimum number of pairwise queries to an
oracle. Oracle can answer queries of the form : "do elements and belong
to the same cluster?" -- the queries can be asked interactively (adaptive
queries), or non-adaptively up-front, but its answer can be erroneous with
probability . In this paper, we provide the first information theoretic
lower bound on the number of queries for clustering with noisy oracle in both
situations. We design novel algorithms that closely match this query complexity
lower bound, even when the number of clusters is unknown. Moreover, we design
computationally efficient algorithms both for the adaptive and non-adaptive
settings. The problem captures/generalizes multiple application scenarios. It
is directly motivated by the growing body of work that use crowdsourcing for
{\em entity resolution}, a fundamental and challenging data mining task aimed
to identify all records in a database referring to the same entity. Here crowd
represents the noisy oracle, and the number of queries directly relates to the
cost of crowdsourcing. Another application comes from the problem of {\em sign
edge prediction} in social network, where social interactions can be both
positive and negative, and one must identify the sign of all pair-wise
interactions by querying a few pairs. Furthermore, clustering with noisy oracle
is intimately connected to correlation clustering, leading to improvement
therein. Finally, it introduces a new direction of study in the popular {\em
stochastic block model} where one has an incomplete stochastic block model
matrix to recover the clusters.Comment: Prior versions of some of the results have appeared before in
arXiv:1604.01839. In this version we rewrote several proofs for clarity, and
included many new result
Subjective Knowledge Acquisition and Enrichment Powered By Crowdsourcing
Knowledge bases (KBs) have attracted increasing attention due to its great
success in various areas, such as Web and mobile search.Existing KBs are
restricted to objective factual knowledge, such as city population or fruit
shape, whereas,subjective knowledge, such as big city, which is commonly
mentioned in Web and mobile queries, has been neglected. Subjective knowledge
differs from objective knowledge in that it has no documented or observed
ground truth. Instead, the truth relies on people's dominant opinion. Thus, we
can use the crowdsourcing technique to get opinion from the crowd. In our work,
we propose a system, called crowdsourced subjective knowledge acquisition
(CoSKA),for subjective knowledge acquisition powered by crowdsourcing and
existing KBs. The acquired knowledge can be used to enrich existing KBs in the
subjective dimension which bridges the gap between existing objective knowledge
and subjective queries.The main challenge of CoSKA is the conflict between
large scale knowledge facts and limited crowdsourcing resource. To address this
challenge, in this work, we define knowledge inference rules and then select
the seed knowledge judiciously for crowdsourcing to maximize the inference
power under the resource constraint. Our experimental results on real knowledge
base and crowdsourcing platform verify the effectiveness of CoSKA system
Active Learning for Crowd-Sourced Databases
Crowd-sourcing has become a popular means of acquiring labeled data for a
wide variety of tasks where humans are more accurate than computers, e.g.,
labeling images, matching objects, or analyzing sentiment. However, relying
solely on the crowd is often impractical even for data sets with thousands of
items, due to time and cost constraints of acquiring human input (which cost
pennies and minutes per label). In this paper, we propose algorithms for
integrating machine learning into crowd-sourced databases, with the goal of
allowing crowd-sourcing applications to scale, i.e., to handle larger datasets
at lower costs. The key observation is that, in many of the above tasks, humans
and machine learning algorithms can be complementary, as humans are often more
accurate but slow and expensive, while algorithms are usually less accurate,
but faster and cheaper.
Based on this observation, we present two new active learning algorithms to
combine humans and algorithms together in a crowd-sourced database. Our
algorithms are based on the theory of non-parametric bootstrap, which makes our
results applicable to a broad class of machine learning models. Our results, on
three real-life datasets collected with Amazon's Mechanical Turk, and on 15
well-known UCI data sets, show that our methods on average ask humans to label
one to two orders of magnitude fewer items to achieve the same accuracy as a
baseline that labels random images, and two to eight times fewer questions than
previous active learning schemes.Comment: A shorter version of this manuscript has been published in
Proceedings of Very Large Data Bases 2015, entitled "Scaling Up
Crowd-Sourcing to Very Large Datasets: A Case for Active Learning
Semisupervised Clustering by Queries and Locally Encodable Source Coding
Source coding is the canonical problem of data compression in information
theory. In a locally encodable source coding, each compressed bit depends on
only few bits of the input. In this paper, we show that a recently popular
model of semi-supervised clustering is equivalent to locally encodable source
coding. In this model, the task is to perform multiclass labeling of unlabeled
elements. At the beginning, we can ask in parallel a set of simple queries to
an oracle who provides (possibly erroneous) binary answers to the queries. The
queries cannot involve more than two (or a fixed constant number of) elements.
Now the labeling of all the elements (or clustering) must be performed based on
the noisy query answers. The goal is to recover all the correct labelings while
minimizing the number of such queries. The equivalence to locally encodable
source codes leads us to find lower bounds on the number of queries required in
a variety of scenarios. We provide querying schemes based on pairwise `same
cluster' queries - and pairwise AND queries and show provable performance
guarantees for each of the schemes.Comment: 16 pages, 11 figures. Some of the results of this paper have appeared
in the proceedings of the 2017 Conference on Neural Information Processing
Systems (NeurIPS 2017
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective
Data collection is a major bottleneck in machine learning and an active
research topic in multiple communities. There are largely two reasons data
collection has recently become a critical issue. First, as machine learning is
becoming more widely-used, we are seeing new applications that do not
necessarily have enough labeled data. Second, unlike traditional machine
learning, deep learning techniques automatically generate features, which saves
feature engineering costs, but in return may require larger amounts of labeled
data. Interestingly, recent research in data collection comes not only from the
machine learning, natural language, and computer vision communities, but also
from the data management community due to the importance of handling large
amounts of data. In this survey, we perform a comprehensive study of data
collection from a data management point of view. Data collection largely
consists of data acquisition, data labeling, and improvement of existing data
or models. We provide a research landscape of these operations, provide
guidelines on which technique to use when, and identify interesting research
challenges. The integration of machine learning and data management for data
collection is part of a larger trend of Big data and Artificial Intelligence
(AI) integration and opens many opportunities for new research.Comment: 20 page
- …