580,065 research outputs found

    Reasoning about Record Matching Rules

    Get PDF
    To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics . (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O ( n 2 ) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods. </jats:p

    Are routinely collected NHS administrative records suitable for endpoint identification in clinical trials? Evidence from the West of Scotland coronary prevention study

    Get PDF
    Background: Routinely collected electronic patient records are already widely used in epidemiological research. In this work we investigated the potential for using them to identify endpoints in clinical trials.&lt;p&gt;&lt;/p&gt; Methods: The events recorded in the West of Scotland Coronary Prevention Study (WOSCOPS), a large clinical trial of pravastatin in middle-aged hypercholesterolaemic men in the 1990s, were compared with those in the record-linked deaths and hospitalisations records routinely collected in Scotland.&lt;p&gt;&lt;/p&gt; Results: We matched 99% of fatal study events by date. We showed excellent matching (97%) of the causes of fatal endpoint events and good matching (.80% for first events) of the causes of nonfatal endpoint events with a slightly lower rate of mismatching of record linkage than study events (19% of first study myocardial infarctions (MI) and 4% of first record linkage MIs not matched as MI). We also investigated the matching of non-endpoint events and showed a good level of matching, with .78% of first stroke/TIA events being matched as stroke/TIA. The primary reasons for mismatches were record linkage data recording readmissions for procedures or previous events, differences between the diagnoses in the routinely collected data and the conclusions of the clinical trial expert adjudication committee, events occurring outside Scotland and therefore being missed by record linkage data, miscoding of cardiac events in hospitalisations data as ‘unspecified chest pain’, some general miscoding in the record linkage data and some record linkage errors.&lt;p&gt;&lt;/p&gt; Conclusions: We conclude that routinely collected data could be used for recording cardiovascular endpoints in clinical trials and would give very similar results to rigorously collected clinical trial data, in countries with unified health systems such as Scotland. The endpoint types would need to be carefully thought through and an expert endpoint adjudication committee should be involved.&lt;p&gt;&lt;/p&gt

    A hierarchical Bayesian approach to record linkage and population size problems

    Full text link
    We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A method and a tool for geocoding and record linkage

    Get PDF
    For many years, researchers have presented the geocoding of postal addresses as a challenge. Several research works have been devoted to achieve the geocoding process. This paper presents theoretical and technical aspects for geolocalization, geocoding, and record linkage. It shows possibilities and limitations of existing methods and commercial software identifying areas for further research. In particular, we present a methodology and a computing tool allowing the correction and the geo-coding of mailing addresses. The paper presents two main steps of the methodology. The first preliminary step is addresses correction (addresses matching), while the second caries geocoding of identified addresses. Additionally, we present some results from the processing of real data sets. Finally, in the discussion, areas for further research are identified.addresses correction; geocodage; matching; data management; record linkage

    Record Linkage Based on Entities\u27 Behavior

    Get PDF
    Record linkage is the problem of identifying similar records across different data sources. Traditional record linkage techniques focus on using simple database attributes in a textual similarity comparison to decide on matched and non-matched records. Recently, record linkage techniques have considered useful extracted knowledge and domain information to help enhancing the matching accuracy. In this paper, we present a new technique for record linkage that is based on entity’s behavior, which can be extracted from a transaction log. In the matching process, we measure the improvement of identifying a behavior when comparing two entities by merging their transaction log. To do so, we use two matching phases; first, a candidate generation phase, which is fast and provide almost no false negatives, while producing low precision. Second, an accurate matching phase, which enhances the precision of the matching at high run time cost. In the candidates phase generation, behavior is represented by points in the complex plan, where we perform approximate evaluations. In the accurate matching phase, we use a heuristic called compressibility, where identified behaviors are more compressible. Our experiments show that the proposed technique can be used to enhance the record linkage quality while being practical for large logs. We also perform extensive sensitivity analysis for the technique’s accuracy and performance

    Recursive proof of the Bell-Kochen-Specker theorem in any dimension n>3n>3

    Full text link
    We present a method to obtain sets of vectors proving the Bell-Kochen-Specker theorem in dimension nn from a similar set in dimension dd (3d<n2d3\leq d<n\leq 2d). As an application of the method we find the smallest proofs known in dimension five (29 vectors), six (31) and seven (34), and different sets matching the current record (36) in dimension eight.Comment: LaTeX, 7 page
    corecore