1,250 research outputs found
Interactive Data Exploration with Smart Drill-Down
We present {\em smart drill-down}, an operator for interactively exploring a
relational table to discover and summarize "interesting" groups of tuples. Each
group of tuples is described by a {\em rule}. For instance, the rule tells us that there are a thousand tuples with value in the
first column and in the second column (and any value in the third column).
Smart drill-down presents an analyst with a list of rules that together
describe interesting aspects of the table. The analyst can tailor the
definition of interesting, and can interactively apply smart drill-down on an
existing rule to explore that part of the table. We demonstrate that the
underlying optimization problems are {\sc NP-Hard}, and describe an algorithm
for finding the approximately optimal list of rules to display when the user
uses a smart drill-down, and a dynamic sampling scheme for efficiently
interacting with large tables. Finally, we perform experiments on real datasets
on our experimental prototype to demonstrate the usefulness of smart drill-down
and study the performance of our algorithms
Comprehensive and Reliable Crowd Assessment Algorithms
Evaluating workers is a critical aspect of any crowdsourcing system. In this
paper, we devise techniques for evaluating workers by finding confidence
intervals on their error rates. Unlike prior work, we focus on
"conciseness"---that is, giving as tight a confidence interval as possible.
Conciseness is of utmost importance because it allows us to be sure that we
have the best guarantee possible on worker error rate. Also unlike prior work,
we provide techniques that work under very general scenarios, such as when not
all workers have attempted every task (a fairly common scenario in practice),
when tasks have non-boolean responses, and when workers have different biases
for positive and negative tasks. We demonstrate conciseness as well as accuracy
of our confidence intervals by testing them on a variety of conditions and
multiple real-world datasets.Comment: ICDE 201
Consistency in a Partitioned Network: A Survey
Recently, several strategies for transaction processing in partitioned distributed database systems with replicated data have been proposed. We survey these strategies in light of the competing goals of maintaining correctness and achieving high availability. Extensions and combinations are then discussed, and guidelines for the selection of a strategy for a particular application are presented
Joint Entity Resolution
Abstract Entity resolution (ER) is the process of matching records that represent the same real-world entity and then merging them. We consider the ER problem for two related datasets. In the datasets, a record in one can refer to a record in the other and an ER process running on one set can affect an ER process on the other. We formalize the joint ER model for datasets which reference each other by treating the match and merge functions as black boxes. We identify important properties for match and merge functions that, if satisfied, allow much more efficient ER. We provide four algorithms that run Entity Resolution for a pair of datasets. We show that our parallel algorithms require shorter runtime than naive alternate algorithms. We also introduce improvements for our parallel algorithms which result in fewer feature comparisons
Indexing boolean expressions.
ABSTRACT We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a high-dimensional multi-valued attribute space. The goal is to rapidly find the set of Boolean expressions that evaluate to true for a given assignment of values to attributes. A solution to this problem has applications in online advertising (where a Boolean expression represents an advertiser's user targeting requirements, and an assignment of values to attributes represents the characteristics of a user visiting an online page) and in general any publish/subscribe system (where a Boolean expression represents a subscription, and an assignment of values to attributes represents an event). All existing solutions that we are aware of can only index a specialized sub-set of conjunctive and/or disjunctive expressions, and cannot efficiently handle general DNF and CNF expressions (including NOTs) over multi-valued attributes. In this paper, we present a novel solution based on the inverted list data structure that enables us to index arbitrarily complex DNF and CNF Boolean expressions over multi-valued attributes. An interesting aspect of our solution is that, by virtue of leveraging inverted lists traditionally used for ranked information retrieval, we can efficiently return the top-N matching Boolean expressions. This capability enables emerging applications such as ranked publish/subscribe system
- …