2,809 research outputs found
Active Learning for Crowd-Sourced Databases
Crowd-sourcing has become a popular means of acquiring labeled data for a
wide variety of tasks where humans are more accurate than computers, e.g.,
labeling images, matching objects, or analyzing sentiment. However, relying
solely on the crowd is often impractical even for data sets with thousands of
items, due to time and cost constraints of acquiring human input (which cost
pennies and minutes per label). In this paper, we propose algorithms for
integrating machine learning into crowd-sourced databases, with the goal of
allowing crowd-sourcing applications to scale, i.e., to handle larger datasets
at lower costs. The key observation is that, in many of the above tasks, humans
and machine learning algorithms can be complementary, as humans are often more
accurate but slow and expensive, while algorithms are usually less accurate,
but faster and cheaper.
Based on this observation, we present two new active learning algorithms to
combine humans and algorithms together in a crowd-sourced database. Our
algorithms are based on the theory of non-parametric bootstrap, which makes our
results applicable to a broad class of machine learning models. Our results, on
three real-life datasets collected with Amazon's Mechanical Turk, and on 15
well-known UCI data sets, show that our methods on average ask humans to label
one to two orders of magnitude fewer items to achieve the same accuracy as a
baseline that labels random images, and two to eight times fewer questions than
previous active learning schemes.Comment: A shorter version of this manuscript has been published in
Proceedings of Very Large Data Bases 2015, entitled "Scaling Up
Crowd-Sourcing to Very Large Datasets: A Case for Active Learning
Getting It All from the Crowd
Hybrid human/computer systems promise to greatly expand the usefulness of
query processing by incorporating the crowd for data gathering and other tasks.
Such systems raise many database system implementation questions. Perhaps most
fundamental is that the closed world assumption underlying relational query
semantics does not hold in such systems. As a consequence the meaning of even
simple queries can be called into question. Furthermore query progress
monitoring becomes difficult due to non-uniformities in the arrival of
crowdsourced data and peculiarities of how people work in crowdsourcing
systems. To address these issues, we develop statistical tools that enable
users and systems developers to reason about tradeoffs between time/cost and
completeness. These tools can also help drive query execution and crowdsourcing
strategies. We evaluate our techniques using experiments on a popular
crowdsourcing platform.Comment: 12 pages, 8 figure
CrowdER: Crowdsourcing Entity Resolution
Entity resolution is central to data integration and data cleaning.
Algorithmic approaches have been improving in quality, but remain far from
perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow)
way to bring human insight into the process. Previous work has proposed
batching verification tasks for presentation to human workers but even with
batching, a human-only approach is infeasible for data sets of even moderate
size, due to the large numbers of matches to be tested. Instead, we propose a
hybrid human-machine approach in which machines are used to do an initial,
coarse pass over all the data, and people are used to verify only the most
likely matching pairs. We show that for such a hybrid system, generating the
minimum number of verification tasks of a given size is NP-Hard, but we develop
a novel two-tiered heuristic approach for creating batched tasks. We describe
this method, and present the results of extensive experiments on real data sets
using a popular crowdsourcing platform. The experiments show that our hybrid
approach achieves both good efficiency and high accuracy compared to
machine-only or human-only alternatives.Comment: VLDB201
CLAMShell: Speeding up Crowds for Low-latency Data Labeling
Data labeling is a necessary but often slow process that impedes the
development of interactive systems for modern data analysis. Despite rising
demand for manual data labeling, there is a surprising lack of work addressing
its high and unpredictable latency. In this paper, we introduce CLAMShell, a
system that speeds up crowds in order to achieve consistently low-latency data
labeling. We offer a taxonomy of the sources of labeling latency and study
several large crowd-sourced labeling deployments to understand their empirical
latency profiles. Driven by these insights, we comprehensively tackle each
source of latency, both by developing novel techniques such as straggler
mitigation and pool maintenance and by optimizing existing methods such as
crowd retainer pools and active learning. We evaluate CLAMShell in simulation
and on live workers on Amazon's Mechanical Turk, demonstrating that our
techniques can provide an order of magnitude speedup and variance reduction
over existing crowdsourced labeling strategies
BoostClean: Automated Error Detection and Repair for Machine Learning
Predictive models based on machine learning can be highly sensitive to data
error. Training data are often combined with a variety of different sources,
each susceptible to different types of inconsistencies, and new data streams
during prediction time, the model may encounter previously unseen
inconsistencies. An important class of such inconsistencies is domain value
violations that occur when an attribute value is outside of an allowed domain.
We explore automatically detecting and repairing such violations by leveraging
the often available clean test labels to determine whether a given detection
and repair combination will improve model accuracy. We present BoostClean which
automatically selects an ensemble of error detection and repair combinations
using statistical boosting. BoostClean selects this ensemble from an extensible
library that is pre-populated general detection functions, including a novel
detector based on the Word2Vec deep learning model, which detects errors across
a diverse set of domains. Our evaluation on a collection of 12 datasets from
Kaggle, the UCI repository, real-world data analyses, and production datasets
that show that Boost- Clean can increase absolute prediction accuracy by up to
9% over the best non-ensembled alternatives. Our optimizations including
parallelism, materialization, and indexing techniques show a 22.2x end-to-end
speedup on a 16-core machine
Power propagation time and lower bounds for power domination number
We present a counterexample to a lower bound for the power domination number
given in Liao, Power domination with bounded time constraints, J. Comb. Optim.
31 (2016)725-742. We also define the power propagation time, using the power
domination propagation ideas in Liao and the (zero forcing) propagation time in
Hogben et al, Propagation time for zero forcing on a graph, Discrete Appl.
Math.160 (2012) 1994-2005
PIQL: Success-Tolerant Query Processing in the Cloud
Newly-released web applications often succumb to a "Success Disaster," where
overloaded database machines and resulting high response times destroy a
previously good user experience. Unfortunately, the data independence provided
by a traditional relational database system, while useful for agile
development, only exacerbates the problem by hiding potentially expensive
queries under simple declarative expressions. As a result, developers of these
applications are increasingly abandoning relational databases in favor of
imperative code written against distributed key/value stores, losing the many
benefits of data independence in the process. Instead, we propose PIQL, a
declarative language that also provides scale independence by calculating an
upper bound on the number of key/value store operations that will be performed
for any query. Coupled with a service level objective (SLO) compliance
prediction model and PIQL's scalable database architecture, these bounds make
it easy for developers to write success-tolerant applications that support an
arbitrarily large number of users while still providing acceptable performance.
In this paper, we present the PIQL query processing system and evaluate its
scale independence on hundreds of machines using two benchmarks, TPC-W and
SCADr.Comment: VLDB201
ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models
Data cleaning is often an important step to ensure that predictive models,
such as regression and classification, are not affected by systematic errors
such as inconsistent, out-of-date, or outlier data. Identifying dirty data is
often a manual and iterative process, and can be challenging on large datasets.
However, many data cleaning workflows can introduce subtle biases into the
training processes due to violation of independence assumptions. We propose
ActiveClean, a progressive cleaning approach where the model is updated
incrementally instead of re-training and can guarantee accuracy on partially
cleaned data. ActiveClean supports a popular class of models called convex loss
models (e.g., linear regression and SVMs). ActiveClean also leverages the
structure of a user's model to prioritize cleaning those records likely to
affect the results. We evaluate ActiveClean on five real-world datasets UCI
Adult, UCI EEG, MNIST, Dollars For Docs, and WorldBank with both real and
synthetic errors. Our results suggest that our proposed optimizations can
improve model accuracy by up-to 2.5x for the same amount of data cleaned.
Furthermore for a fixed cleaning budget and on all real dirty datasets,
ActiveClean returns more accurate models than uniform sampling and Active
Learning.Comment: Pre-prin
Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views
Materialized views (MVs), stored pre-computed results, are widely used to
facilitate fast queries on large datasets. When new records arrive at a high
rate, it is infeasible to continuously update (maintain) MVs and a common
solution is to defer maintenance by batching updates together. Between batches
the MVs become increasingly stale with incorrect, missing, and superfluous rows
leading to increasingly inaccurate query results. We propose Stale View
Cleaning (SVC) which addresses this problem from a data cleaning perspective.
In SVC, we efficiently clean a sample of rows from a stale MV, and use the
clean sample to estimate aggregate query results. While approximate, the
estimated query results reflect the most recent data. As sampling can be
sensitive to long-tailed distributions, we further explore an outlier indexing
technique to give increased accuracy when the data distributions are skewed.
SVC complements existing deferred maintenance approaches by giving accurate and
bounded query answers between maintenance. We evaluate our method on a
generated dataset from the TPC-D benchmark and a real video distribution
application. Experiments confirm our theoretical results: (1) cleaning an MV
sample is more efficient than full view maintenance, (2) the estimated results
are more accurate than using the stale MV, and (3) SVC is applicable for a wide
variety of MVs
The Expected Optimal Labeling Order Problem for Crowdsourced Joins and Entity Resolution
In the SIGMOD 2013 conference, we published a paper extending our earlier
work on crowdsourced entity resolution to improve crowdsourced join processing
by exploiting transitive relationships [Wang et al. 2013]. The VLDB 2014
conference has a paper that follows up on our previous work [Vesdapunt et al.,
2014], which points out and corrects a mistake we made in our SIGMOD paper.
Specifically, in Section 4.2 of our SIGMOD paper, we defined the "Expected
Optimal Labeling Order" (EOLO) problem, and proposed an algorithm for solving
it. We incorrectly claimed that our algorithm is optimal. In their paper,
Vesdapunt et al. show that the problem is actually NP-Hard, and based on that
observation, propose a new algorithm to solve it. In this note, we would like
to put the Vesdapunt et al. results in context, something we believe that their
paper does not adequately do.Comment: This is a note for explaining an incorrect claim in our SIGMOD 2013
pape
- …