Search CORE

6 research outputs found

Recommended from our members

Noise-tolerant approximate blocking for dynamic real-time entity resolution

Author: Christen Peter
Gayler Ross
Liang Huizhi
Wang Yanzhe
Publication venue
Publication date: 01/01/2014
Field of study

Entity resolution is the process of identifying records in one or multiple data sources that represent the same real-world entity. This process needs to deal with noisy data that contain for example wrong pronunciation or spelling errors. Many real world applications require rapid responses for entity queries on dynamic datasets. This brings challenges to existing approaches which are mainly aimed at the batch matching of records in static data. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. How to make approximate blocking approaches scalable to large datasets and effective for entity resolution in real-time remains an open question. Targeting this problem, we propose a noise-tolerant approximate blocking approach to index records based on their distance ranges using LSH and sorting trees within large sized hash blocks. Experiments conducted on both synthetic and real-world datasets show the effectiveness of the proposed approach

Central Archive at the University of Reading

Crossref

A Comparison of Blocking Methods for Record Linkage

Author: A. Goldenberg
D. Vatsalan
H. Liang
L. Paulevé
M. Kuzu
P. Christen
P. Christen
P. Christen
R. Hall
S. Fortunato
T. Herzog
Publication venue
Publication date: 01/01/2014
Field of study

Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality sensitive hashing, sometimes referred to as "private blocking." We compare these approaches in terms of their recall, reduction ratio, and computational complexity. We evaluate these methods using different synthetic datafiles and conclude with a discussion of privacy-related issues.Comment: 22 pages, 2 tables, 7 figure

arXiv.org e-Print Archive

Crossref

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Author: Dong Xin Luna
Faloutsos Christos
Page David
Sisman Bunyamin
Wei Hao
Zhang Wei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/12/2019
Field of study

Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human effort in cleaning data and designing blocking keys. In this paper, we propose AutoBlock, a novel hands-off blocking framework for entity matching, based on similarity-preserving representation learning and nearest neighbor search. Our contributions include: (a) Automation: AutoBlock frees users from laborious data cleaning and blocking key tuning. (b) Scalability: AutoBlock has a sub-quadratic total time complexity and can be easily deployed for millions of records. (c) Effectiveness: AutoBlock outperforms a wide range of competitive baselines on multiple large-scale, real-world datasets, especially when datasets are dirty and/or unstructured.Comment: In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, Anchorage, Alaska, USA , 9 page

arXiv.org e-Print Archive

Crossref

Noise-Tolerant Approximate Blocking for Dynamic Real-Time Entity Resolution

Author: B. Ramadan
P. Christen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Crossref

Noise-tolerant approximate blocking for dynamic real-time entity resolution

Author: Christen Peter
Gayler Ross
Liang Huizhi (Elly)
Wang Yuanzhi (Derek)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/12/2015
Field of study

The Australian National University

Advances in knowledge discovery and data mining Part II

Author: CAO Tru
CHEUNG David Wai-Lok
HO Tu-Bao
LIM Ee Peng
MOTODA Hiroshi
ZHOU Zhi-Hua
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

Institutional Knowledge at Singapore Management University

HKU Scholars Hub