580,065 research outputs found
Reasoning about Record Matching Rules
To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from
unreliable
data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of
matching dependencies
(MDs) for specifying the semantics of data in unreliable relations, defined in terms of
similarity metrics
and a
dynamic semantics
. (b) We identify a special case of MDs, referred to as
relative candidate keys
(RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an
O
(
n
2
) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.
</jats:p
Are routinely collected NHS administrative records suitable for endpoint identification in clinical trials? Evidence from the West of Scotland coronary prevention study
Background: Routinely collected electronic patient records are already widely used in epidemiological research. In this work we investigated the potential for using them to identify endpoints in clinical trials.<p></p>
Methods: The events recorded in the West of Scotland Coronary Prevention Study (WOSCOPS), a large clinical trial of pravastatin in middle-aged hypercholesterolaemic men in the 1990s, were compared with those in the record-linked deaths and hospitalisations records routinely collected in Scotland.<p></p>
Results: We matched 99% of fatal study events by date. We showed excellent matching (97%) of the causes of fatal
endpoint events and good matching (.80% for first events) of the causes of nonfatal endpoint events with a slightly lower
rate of mismatching of record linkage than study events (19% of first study myocardial infarctions (MI) and 4% of first record linkage MIs not matched as MI). We also investigated the matching of non-endpoint events and showed a good level of matching, with .78% of first stroke/TIA events being matched as stroke/TIA. The primary reasons for mismatches were record linkage data recording readmissions for procedures or previous events, differences between the diagnoses in the routinely collected data and the conclusions of the clinical trial expert adjudication committee, events occurring outside Scotland and therefore being missed by record linkage data, miscoding of cardiac events in hospitalisations data as ‘unspecified chest pain’, some general miscoding in the record linkage data and some record linkage errors.<p></p>
Conclusions: We conclude that routinely collected data could be used for recording cardiovascular endpoints in clinical
trials and would give very similar results to rigorously collected clinical trial data, in countries with unified health systems such as Scotland. The endpoint types would need to be carefully thought through and an expert endpoint adjudication committee should be involved.<p></p>
A hierarchical Bayesian approach to record linkage and population size problems
We propose and illustrate a hierarchical Bayesian approach for matching
statistical records observed on different occasions. We show how this model can
be profitably adopted both in record linkage problems and in capture--recapture
setups, where the size of a finite population is the real object of interest.
There are at least two important differences between the proposed model-based
approach and the current practice in record linkage. First, the statistical
model is built up on the actually observed categorical variables and no
reduction (to 0--1 comparisons) of the available information takes place.
Second, the hierarchical structure of the model allows a two-way propagation of
the uncertainty between the parameter estimation step and the matching
procedure so that no plug-in estimates are used and the correct uncertainty is
accounted for both in estimating the population size and in performing the
record linkage. We illustrate and motivate our proposal through a real data
example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A method and a tool for geocoding and record linkage
For many years, researchers have presented the geocoding of postal addresses as a challenge. Several research works have been devoted to achieve the geocoding process. This paper presents theoretical and technical aspects for geolocalization, geocoding, and record linkage. It shows possibilities and limitations of existing methods and commercial software identifying areas for further research. In particular, we present a methodology and a computing tool allowing the correction and the geo-coding of mailing addresses. The paper presents two main steps of the methodology. The first preliminary step is addresses correction (addresses matching), while the second caries geocoding of identified addresses. Additionally, we present some results from the processing of real data sets. Finally, in the discussion, areas for further research are identified.addresses correction; geocodage; matching; data management; record linkage
Record Linkage Based on Entities\u27 Behavior
Record linkage is the problem of identifying similar records across different data sources. Traditional record linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched records. Recently, record linkage techniques have considered useful extracted knowledge and domain information to help enhancing the matching accuracy. In this paper, we present a new technique for record linkage that is based on entity’s behavior, which can be extracted from a transaction log. In the matching process, we measure the improvement of identifying a behavior when comparing two entities by merging their transaction log. To do so, we use two matching phases; first, a candidate generation phase, which is fast and provide almost no false negatives, while producing low precision. Second, an accurate matching phase, which enhances the precision of the matching at high run time cost. In the candidates phase generation, behavior is represented by points in the complex plan, where we perform approximate evaluations. In the accurate matching phase, we use a heuristic called compressibility, where identified behaviors are more compressible. Our experiments show that the proposed technique can be used to enhance the record linkage quality while being practical for large logs. We also perform extensive sensitivity analysis for the technique’s accuracy and performance
Recursive proof of the Bell-Kochen-Specker theorem in any dimension
We present a method to obtain sets of vectors proving the Bell-Kochen-Specker
theorem in dimension from a similar set in dimension (). As an application of the method we find the smallest proofs known in
dimension five (29 vectors), six (31) and seven (34), and different sets
matching the current record (36) in dimension eight.Comment: LaTeX, 7 page
- …