Search CORE

308 research outputs found

INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION

Author: LAKSHMAIAH K.
YEDDULA SUNITHA
Publication venue: Institute for Project Management Pvt. Ltd
Publication date: 07/09/2020
Field of study

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys

Interscience Research Network

Cloud-Scale Entity Resolution: Current State and Open Challenges

Author: Eike Schallehn
Gunter Saake
Xiao Chen
Publication venue: RonPub
Publication date: 01/01/2018
Field of study

Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

RonPub -- Research Online Publishing

WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT

Author: Abd Al-Elah Higazy , Azza
El-Tobely Tarek E.
Sarhan , Amany M.
Publication venue: Arab Journals Platform
Publication date: 03/10/2023
Field of study

Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system’s behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools

Arab Journals Platform

Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Author: A Beimel
A Geissbuhler
AF Karr
AF Karr
AL Potosky
Antonis Michalas
AS Lunde
B Pinkas
BA Malin
BA Stewart
BH Bloom
C Clifton
C Friedman
C Quantin
D Vatsalan
EA Durham
G Cormode
G Hripcsak
GM Weber
GM Weber
GM Weber
IS Kohane
J Gichoya
J Vaidya
JF Ludvigsson
JH Holmes
JL Warren
Johan Gustav Bellika
JT Finnell
K Emam El
K Emam El
K Emam El
K Emam El
Kassaye Yitbarek Yigzaw
L Fan
L Lenert
LH Curtis
M Kantarcioglu
MA Hailemichael
MA Hernández
MK Ross
O Goldreich
P Christen
P Paillier
P Saint-Andre
R Cramer
R Lazarus
R Lazarus
R Schnell
RL Richesson
S Tarkoma
SC Pohlig
SM Randall
T Dimitriou
W Du
W Du
WB Lober
Y Lindell
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians

Crossref

WestminsterResearch

Flexible and Efficient Distributed Resolution of Large Entities

Author: C.I. Sidló
D. Menestrina
H. Köpcke
H. Köpcke
I. Bhattacharya
I. Bhattacharya
I. Fellegi
J. Dean
L. Getoor
M. Boley
M. Hernández
M. Weis
M. Yakout
O. Benjelloun
P. Christen
S. Guo
S.E. Whang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

SZTAKI Publication Repository