Search CORE

1,250 research outputs found

Interactive Data Exploration with Smart Drill-Down

Author: Garcia-Molina Hector
Joglekar Manas
Parameswaran Aditya
Publication venue
Publication date: 01/05/2016
Field of study

We present {\em smart drill-down}, an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a {\em rule}. For instance, the rule

(a, b, \star, 1000)

tells us that there are a thousand tuples with value

a

in the first column and

b

in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are {\sc NP-Hard}, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

Comprehensive and Reliable Crowd Assessment Algorithms

Author: Garcia-Molina Hector
Joglekar Manas
Parameswaran Aditya
Publication venue
Publication date: 12/11/2014
Field of study

Evaluating workers is a critical aspect of any crowdsourcing system. In this paper, we devise techniques for evaluating workers by finding confidence intervals on their error rates. Unlike prior work, we focus on "conciseness"---that is, giving as tight a confidence interval as possible. Conciseness is of utmost importance because it allows us to be sure that we have the best guarantee possible on worker error rate. Also unlike prior work, we provide techniques that work under very general scenarios, such as when not all workers have attempted every task (a fairly common scenario in practice), when tasks have non-boolean responses, and when workers have different biases for positive and negative tasks. We demonstrate conciseness as well as accuracy of our confidence intervals by testing them on a variety of conditions and multiple real-world datasets.Comment: ICDE 201

arXiv.org e-Print Archive

Crossref

Consistency in a Partitioned Network: A Survey

Author: Davidson Susan B
Garcia-Molina Hector
Skeen Dale
Publication venue: ScholarlyCommons
Publication date: 01/01/1984
Field of study

Recently, several strategies for transaction processing in partitioned distributed database systems with replicated data have been proposed. We survey these strategies in light of the competing goals of maintaining correctness and achieving high availability. Extensions and combinations are then discussed, and guidelines for the selection of a strategy for a particular application are presented

Crossref

ScholarlyCommons@Penn

eCommons (Cornell Univ.)

Time as essence for photo browsing through personal digital libraries

Author: Adrian Graham
Andreas Paepcke
Hector Garcia-Molina
Terry Winograd
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Joint Entity Resolution

Author: Hector Garcia-Molina
Makoto Tachinaba
Publication venue
Publication date: 11/04/2020
Field of study

Abstract Entity resolution (ER) is the process of matching records that represent the same real-world entity and then merging them. We consider the ER problem for two related datasets. In the datasets, a record in one can refer to a record in the other and an ER process running on one set can affect an ER process on the other. We formalize the joint ER model for datasets which reference each other by treating the match and merge functions as black boxes. We identify important properties for match and merge functions that, if satisfied, allow much more efficient ER. We provide four algorithms that run Entity Resolution for a pair of datasets. We show that our parallel algorithms require shorter runtime than naive alternate algorithms. We also introduce improvements for our parallel algorithms which result in fewer feature comparisons

CiteSeerX

Indexing boolean expressions.

Author: Chad Brower
Erik Vee
Hector Garcia-Molina
Jayavel Shanmugasundaram
Ramana Yerneni
Sergei Vassilvitskii
Steven Euijong Whang
Publication venue
Publication date: 01/01/2009
Field of study

ABSTRACT We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a high-dimensional multi-valued attribute space. The goal is to rapidly find the set of Boolean expressions that evaluate to true for a given assignment of values to attributes. A solution to this problem has applications in online advertising (where a Boolean expression represents an advertiser's user targeting requirements, and an assignment of values to attributes represents the characteristics of a user visiting an online page) and in general any publish/subscribe system (where a Boolean expression represents a subscription, and an assignment of values to attributes represents an event). All existing solutions that we are aware of can only index a specialized sub-set of conjunctive and/or disjunctive expressions, and cannot efficiently handle general DNF and CNF expressions (including NOTs) over multi-valued attributes. In this paper, we present a novel solution based on the inverted list data structure that enables us to index arbitrarily complex DNF and CNF Boolean expressions over multi-valued attributes. An interesting aspect of our solution is that, by virtue of leveraging inverted lists traditionally used for ranked information retrieval, we can efficiently return the top-N matching Boolean expressions. This capability enables emerging applications such as ranked publish/subscribe system

CiteSeerX