6 research outputs found
Recommended from our members
Noise-tolerant approximate blocking for dynamic real-time entity resolution
Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate
blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world
datasets show the effectiveness of the proposed approach
A Comparison of Blocking Methods for Record Linkage
Record linkage seeks to merge databases and to remove duplicates when unique
identifiers are not available. Most approaches use blocking techniques to
reduce the computational complexity associated with record linkage. We review
traditional blocking techniques, which typically partition the records
according to a set of field attributes, and consider two variants of a method
known as locality sensitive hashing, sometimes referred to as "private
blocking." We compare these approaches in terms of their recall, reduction
ratio, and computational complexity. We evaluate these methods using different
synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure
AutoBlock: A Hands-off Blocking Framework for Entity Matching
Entity matching seeks to identify data records over one or multiple data
sources that refer to the same real-world entity. Virtually every entity
matching task on large datasets requires blocking, a step that reduces the
number of record pairs to be matched. However, most of the traditional blocking
methods are learning-free and key-based, and their successes are largely built
on laborious human effort in cleaning data and designing blocking keys.
In this paper, we propose AutoBlock, a novel hands-off blocking framework for
entity matching, based on similarity-preserving representation learning and
nearest neighbor search. Our contributions include: (a) Automation: AutoBlock
frees users from laborious data cleaning and blocking key tuning. (b)
Scalability: AutoBlock has a sub-quadratic total time complexity and can be
easily deployed for millions of records. (c) Effectiveness: AutoBlock
outperforms a wide range of competitive baselines on multiple large-scale,
real-world datasets, especially when datasets are dirty and/or unstructured.Comment: In The Thirteenth ACM International Conference on Web Search and Data
Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, Anchorage,
Alaska, USA , 9 page
Noise-tolerant approximate blocking for dynamic real-time entity resolution
Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real worl
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p