Search CORE

580,065 research outputs found

Reasoning about Record Matching Rules

Author: Fan Wenfei
Jia Xibei
Li Jianzhong
Ma Shuai
Publication venue
Publication date: 01/01/2009
Field of study

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics . (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O ( n 2 ) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods. </jats:p

Crossref

Edinburgh Research Explorer

Are routinely collected NHS administrative records suitable for endpoint identification in clinical trials? Evidence from the West of Scotland coronary prevention study

Author: A Dale
Allan Gaw
Eleanor Dinnett
I Ford
I Ford
I Ford
Ian Ford
J Shepherd
L Gray
M Hamer
M van Ham
Nicholas M. Pajewski
PJ Boyle
Sarah J. E. Barry
Sharon Kean
T Clemens
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/09/2013
Field of study

Background: Routinely collected electronic patient records are already widely used in epidemiological research. In this work we investigated the potential for using them to identify endpoints in clinical trials. Methods: The events recorded in the West of Scotland Coronary Prevention Study (WOSCOPS), a large clinical trial of pravastatin in middle-aged hypercholesterolaemic men in the 1990s, were compared with those in the record-linked deaths and hospitalisations records routinely collected in Scotland. Results: We matched 99% of fatal study events by date. We showed excellent matching (97%) of the causes of fatal endpoint events and good matching (.80% for first events) of the causes of nonfatal endpoint events with a slightly lower rate of mismatching of record linkage than study events (19% of first study myocardial infarctions (MI) and 4% of first record linkage MIs not matched as MI). We also investigated the matching of non-endpoint events and showed a good level of matching, with .78% of first stroke/TIA events being matched as stroke/TIA. The primary reasons for mismatches were record linkage data recording readmissions for procedures or previous events, differences between the diagnoses in the routinely collected data and the conclusions of the clinical trial expert adjudication committee, events occurring outside Scotland and therefore being missed by record linkage data, miscoding of cardiac events in hospitalisations data as ‘unspecified chest pain’, some general miscoding in the record linkage data and some record linkage errors. Conclusions: We conclude that routinely collected data could be used for recording cardiovascular endpoints in clinical trials and would give very similar results to rigorously collected clinical trial data, in countries with unified health systems such as Scotland. The endpoint types would need to be carefully thought through and an expert endpoint adjudication committee should be involved.</p&gt

Public Library of Science (PLOS)

Crossref

University of Strathclyde Institutional Repository

Directory of Open Access Journals

PubMed Central

Enlighten

FigShare

A hierarchical Bayesian approach to record linkage and population size problems

Author: Liseo Brunero
Tancredi Andrea
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

A method and a tool for geocoding and record linkage

Author: CHARIF Omar
KLEIN Olivier
OMRANI Hichem
SCHNEIDER Marc
TRIGANO Philippe
Publication venue
Publication date
Field of study

For many years, researchers have presented the geocoding of postal addresses as a challenge. Several research works have been devoted to achieve the geocoding process. This paper presents theoretical and technical aspects for geolocalization, geocoding, and record linkage. It shows possibilities and limitations of existing methods and commercial software identifying areas for further research. In particular, we present a methodology and a computing tool allowing the correction and the geo-coding of mailing addresses. The paper presents two main steps of the methodology. The first preliminary step is addresses correction (addresses matching), while the second caries geocoding of identified addresses. Additionally, we present some results from the processing of real data sets. Finally, in the discussion, areas for further research are identified.addresses correction; geocodage; matching; data management; record linkage

Research Papers in Economics

Record Linkage Based on Entities\u27 Behavior

Author: Elmagarmid Ahmed K.
Elmeleegy Hazen
Ouzzani Mourad
Yakout Mohamed
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2008
Field of study

Record linkage is the problem of identifying similar records across different data sources. Traditional record linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched records. Recently, record linkage techniques have considered useful extracted knowledge and domain information to help enhancing the matching accuracy. In this paper, we present a new technique for record linkage that is based on entity’s behavior, which can be extracted from a transaction log. In the matching process, we measure the improvement of identifying a behavior when comparing two entities by merging their transaction log. To do so, we use two matching phases; first, a candidate generation phase, which is fast and provide almost no false negatives, while producing low precision. Second, an accurate matching phase, which enhances the precision of the matching at high run time cost. In the candidates phase generation, behavior is represented by points in the complex plan, where we perform approximate evaluations. In the accurate matching phase, we use a heuristic called compressibility, where identified behaviors are more compressible. Our experiments show that the proposed technique can be used to enhance the record linkage quality while being practical for large logs. We also perform extensive sensitivity analysis for the technique’s accuracy and performance

CiteSeerX

Purdue E-Pubs

Recursive proof of the Bell-Kochen-Specker theorem in any dimension $n>3$

Author: Adán Cabello
Bell
Bub
Cabello
Cabello
Cabello
Guillermo García-Alcaine
José M. Estebaranz
Kernaghan
Kochen
Pavičić
Peres
Peres
Zimba
Publication venue: 'Elsevier BV'
Publication date: 28/04/2005
Field of study

We present a method to obtain sets of vectors proving the Bell-Kochen-Specker theorem in dimension

n

from a similar set in dimension

d

(

3\leq d<n\leq 2d

). As an application of the method we find the smallest proofs known in dimension five (29 vectors), six (31) and seven (34), and different sets matching the current record (36) in dimension eight.Comment: LaTeX, 7 page

arXiv.org e-Print Archive

Crossref

CERN Document Server