Search CORE

27 research outputs found

Swoosh: a generic approach to entity resolution

Author: A. Motro
David Menestrina
H.B. Newcombe
Hector Garcia-Molina
I.P. Fellegi
Jennifer Widom
M.A. Hernández
M.A. Jaro
Omar Benjelloun
Qi Su
R.E. Tarjan
S. Tejada
Steven Euijong Whang
T.F. Smith
W. Cohen
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

ABSTRACT Generic Entity Resolution with Data Confidences

Author: David Menestrina
Publication venue
Publication date
Field of study

as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. Our approach to the ER problem is generic, in the sense that the functions for comparing and merging records are viewed as black-boxes. In this context, managing numerical confidences along with the data makes the ER problem more challenging to define (e.g., how should confidences of merged records be combined?), and more expensive to compute. In this paper, we propose a sound and flexible model for the ER problem with confidences, and propose efficient algorithms to solve it. We validate our algorithms through experiments that show significant performance improvements over naive schemes. 1

CiteSeerX

Entity resolution with iterative blocking

Author: David Menestrina
Georgia Koutrika
Hector Garcia-molina
Martin Theobald
Steven Euijong Whang
Publication venue
Publication date: 01/01/2008
Field of study

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks. In this paper, we propose an iterative blocking framework where the ER results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. Compared to simple blocking, iterative blocking may achieve higher accuracy because reflecting the ER results of blocks to other blocks may generate additional record matches. Iterative blocking may also be more efficient because processing a block now saves the processing time for other blocks. We implement a scalable iterative blocking system and demonstrate that iterative blocking can be more accurate and efficient than blocking for large datasets

CiteSeerX

DBIS EPub

MPG.PuRe

1 Fuzzy Joins Using MapReduce

Author: Aditya Parameswaran
Anish Das Sarma
David Menestrina
Foto N. Afrati
Jeffrey D. Ullman
Publication venue
Publication date: 01/01/2012
Field of study

Abstract—Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task; for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements. I

CiteSeerX

Crossref

DSpace at NTUA

Generic Entity Resolution in the SERF Project

Author: David Menestrina
Hector Garcia-molina
Hideki Kawai
Jennifer Widom
Omar Benjelloun
Qi Su
Sutthipong Thavisomboon
Tait Eliott Larson
Publication venue
Publication date
Field of study

The SERF project at Stanford deals with the Entity Resolution (ER) problem, in which records determined to represent the same real-life “entities ” (such as people or products) are successively located and combined. The approach we pursue is “generic”, in the sense that the specific functions used to match and merge records are viewed as black boxes, which permits efficient, expressive and extensible ER solutions. This paper motivates and introduces the principles of generic ER, and gives an overview of the research directions we have been exploring in the SERF project over the past two years.

CiteSeerX

Water transport by the bacterial channel α-hemolysin

Author: Bangham
Bayley
Belmonte
Bernheimer
Bezrukov
Bhakdi
Cescatti
David Deamer
De Gier
Finkelstein
Forti
Füssle
Hoch
Jansen
Kasianowicz
Korchev
Mark Akeson
Menestrina
Menestrina
Paganelli
Renkin
Rosenberg
Song
Steck
Stefan Paula
Suleymanian
Terwilliger
Van Hoek
Verkman
Wiener
Ye
Zhang
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref