Search CORE

86 research outputs found

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Author: A Elmagarmid
G Baudat
G Navarro
G Salton
IP Fellegi
J Bleiholder
L Bertossi
O Benjelloun
P Christen
P Christen
S Ceri
TM Cover
TN Herzog
V Rastogi
W Fan
Publication venue
Publication date: 24/08/2015
Field of study

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201

arXiv.org e-Print Archive

Crossref

i-DATAQUEST : a Proposal for a Manufacturing Data Query System Based on a Graph

Author: A Lysenko
A Messina
B-H Yoon
G Navarro
IP Fellegi
R Pinquié
S Schabus
V Bonnici
Z Zhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/07/2020
Field of study

During the manufacturing product life cycle, an increasing volume of data is generated and stored in distributed resources. These data are heterogeneous, explicitly and implicitly linked and they could be structured and unstructured. The rapid, exhaustive and relevant acquisition of information from this data is a major manufacturing industry issue. The key challenges, in this context, are to transform heterogeneous data into a common searchable data model, to allow semantic search, to detect implicit links between data and to rank results by relevance. To address this issue, the authors propose a query system based on a graph database. This graph is defined based on all the transformed manufacturing data. Besides, the graph is enriched by explicitly and implicitly data links. Finally, the enriched graph is queried thanks to an extended queries system defined by a knowledge graph. The authors depict a proof of concept to validate the proposal. After a partial implementation of this proof of concept, the authors obtain an acceptable result and a needed effort to improve the system response time. Finally, the authors open the topic on the subjects of right management, user profile/customization and data update.Chaire ENSAM-Capgemini sur le PLM du futu

Crossref

HAL Descartes

SAM : Science Arts et Métiers

A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

Author: CJ Burges
DF Williamson
DG Altman
DG Altman
DG Altman
DP Silveira da
HB Newcombe
IP Fellegi
JH Friedman
L Breiman
LE Raileanu
LR Dice
M Tromp
P Christen
RS Michalski
SJ Press
VI Levenshtein
X Meng
Y Siegert
Publication venue: 19th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK)
Publication date: 03/08/2017
Field of study

Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results

Crossref

UCL Discovery

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage

Author: AK Elmagarmid
CP Campos de
CP Campos de
D Heckerman
HL Dunn
IP Fellegi
L Leitao
M Hall
M Tromp
MA Jaro
N Friedman
Y Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Using metric space indexing for complete and efficient record linkage

Author: A Reid
B Ramadan
C Li
D Hand
G Papadakis
GR Hjaltason
H Newcombe
IP Fellegi
L Bo
P Christen
P Christen
P Zezula
Q Wang
R Connor
R Connor
RC Steorts
V Levenshtein
XL Dong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.Postprin

Crossref

University of St. Andrews - Pure

St Andrews Research Repository

How good is probabilistic record linkage to reconstruct reproductive histories? Results from the Aberdeen children of the 1950s study

Author: A Coulter
Bianca L DeStavola
CM Coeli
CR Ramsay
D Nitsch
D Whiteman
DA Leon
David A Leon
Dorothea Nitsch
G Howe
GD Batty
HB Newcombe
Heather Clark
HS Shannon
Information and Statistics Division
IP Fellegi
M Fair
M Jaro
MM Adams
R Illsley
S Gomatam
S Harlow
S Kendrick
SMB Morton
Stata Corp
Susan Morton
The West of Scotland Coronary Prevention Study Group
Y Nishiwaki
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Probabilistic record linkage is widely used in epidemiology, but studies of its validity are rare. Our aim was to validate its use to identify births to a cohort of women, being drawn from a large cohort of people born in Scotland in the early 1950s. METHODS: The Children of the 1950s cohort includes 5868 females born in Aberdeen 1950–56 who were in primary schools in the city in 1962. In 2001 a postal questionnaire was sent to the cohort members resident in the UK requesting information on offspring. Probabilistic record linkage (based on surname, maiden name, initials, date of birth and postcode) was used to link the females in the cohort to birth records held by the Scottish Maternity Record System (SMR 2). RESULTS: We attempted to mail a total of 5540 women; 3752 (68%) returned a completed questionnaire. Of these 86% reported having had at least one birth. Linkage to SMR 2 was attempted for 5634 women, one or more maternity records were found for 3743. There were 2604 women who reported at least one birth in the questionnaire and who were linked to one or more SMR 2 records. When judged against the questionnaire information, the linkage correctly identified 4930 births and missed 601 others. These mostly occurred outside of Scotland (147) or prior to full coverage by SMR 2 (454). There were 134 births incorrectly linked to SMR 2. CONCLUSION: Probabilistic record linkage to routine maternity records applied to population-based cohort, using name, date of birth and place of residence, can have high specificity, and as such may be reliably used in epidemiological research

Aberdeen University Research

Crossref

LSHTM Research Online

Springer - Publisher Connector

PubMed Central

A proficient cost reduction framework for de-duplication of records in data integration

Author: AK Elmagarmid
Asif Sohail
Data Integration Manual
E Rahm
F Bauer
F Maggi
H Köpcke
IP Fellegi
J Bleiholder
K Goiser
L Gu
L Gu
L Gu
L Jiang
L Patrick
M Michelson
M Odell
M Samwald
MA Hernandez
MG Elfeky
Muhammad Murtaza Yousaf
P Christen
P Christen
P Giang
R Baxter
S Chaudhuri
S Yan
SE Whang
SE Whang
SM Randall
T Fawcett
U Draisbach
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Investigating increased admissions to neonatal intensive care in England between 1995 and 2006: data linkage study using Hospital Episode Statistics

Author: A Macfarlane
AH Jobe
Andrei S. Morgan
C Abrahams
Elizabeth S. Draper
FL Bahadue
IP Fellegi
K Costeloe
K Moser
Kate Costeloe
KL Costeloe
L Hilder
L Oakley
N Dattani
N Dattani
N Dattani
Neil Marlow
P Contiero
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Author: AK Elmagarmid
CP Campos de
CP Campos de
D Heckerman
E Rahm
H Köpcke
HL Dunn
IP Fellegi
J. Mark Bishop
John Howroyd
L Leitão
M Hall
M Tromp
MA Jaro
Minlue Wang
N Friedman
Sebastian Danicic
T Churches
Valeriia Haberland
Y Zhou
Yun Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/01/2017
Field of study

Probabilistic record linkage is a well established topic in the literature. Fellegi-Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non- match weights for each pair of records. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we ex- tend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on 4 datasets in terms of the linkage performance (F1 score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets

Goldsmiths Research Online

Crossref

Explore Bristol Research

Estimativas de parâmetros no linkage entre os bancos de mortalidade e de hospitalização, segundo a qualidade do registro da causa básica do óbito

Author: Alexandre dos Santos Brito
Brenner H
Camargo Jr. KR
Camargo Jr. KR
Cláudia Medina Coeli
Fellegi IP
Flávia dos Santos Barbosa
Herzog TN
Junger WL
Katia Vergetti Bloch
Kenneth Rochel de Camargo Jr.
Pinheiro RS
Rejane Sobrino Pinheiro
Roberto de Andrade Medronho
Teixeira CLS
Winkler WE
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref