918 research outputs found

    Record Linkage Techniques: Exploring and developing data matching methods to create national record linkage infrastructure to support population level research

    Get PDF
    In a world where the growth in digital information and systems continues to expand, researchers have access to unprecedented amounts of data. These large and complex data reservoirs require creative, innovative and scalable tools to unlock the potential of this ‘big data’. Record linkage is a powerful tool in the ‘big data’ arsenal. This thesis demonstrates the value of national record linkage infrastructure and how this has been achieved for the Australian research community

    The effect of data cleaning on record linkage quality

    Get PDF
    Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.Methods: A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.Results: Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.Conclusions: Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process

    Accuracy and completeness of patient pathways – the benefits of national data linkage in Australia

    Get PDF
    Background - The technical challenges associated with national data linkage, and the extent of cross-border population movements, are explored as part of a pioneering research project. The project involved linking state-based hospital admission records and death registrations across Australia for a national study of hospital related deaths. Methods - The project linked over 44 million morbidity and mortality records from four Australian states between 1st July 1999 and 31st December 2009 using probabilistic methods. The accuracy of the linkage was measured through a comparison with jurisdictional keys sourced from individual states. The extent of cross-border population movement between these states was also assessed. Results - Data matching identified almost twelve million individuals across the four Australian states. The percentage of individuals from one state with records found in another ranged from 3-5 %. Using jurisdictional keys to measure linkage quality, results indicate a high matching efficiency (F measure 97 to 99 %), with linkage processing taking only a matter of days. Conclusions - The results demonstrate the feasibility and accuracy of undertaking cross jurisdictional linkage for national research. The benefits are substantial, particularly in relation to capturing the full complement of records in patient pathways as a result of cross-border population movements. The project identified a sizeable ‘mobile’ population with hospital records in more than one state. Research studies that focus on a single jurisdiction will under-enumerate the extent of hospital usage by individuals in the population. It is important that researchers understand and are aware of the impact of this missing hospital activity on their studies. The project highlights the need for an efficient and accurate data linkage system to support national research across Australia

    Use of graph theory measures to identify errors in record linkage

    Get PDF
    Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes

    A simple sampling method for estimating the accuracy of large scale record linkage projects

    Get PDF
    Background: Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. Objectives: The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. Methods: In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. Results: The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). Conclusions: This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions

    Identifying information from data using an organizational goals ontology: a case of the Australian economy / Tengku Adil Tengku Izhar and Torab Torabi.

    Get PDF
    Organizational data is important to assist domain experts and entrepreneurs for decision making process in relation to the organizational goals. The trustworthiness of organizational data in relation to achieving the organizational goals is often questioned because of the vast amount of organizational data. This paper proposes a methodology to evaluate organizational data that relates to the organizational goals. This refers to the importance of assisting the organization to utilize relevance of organizational data from the vast amount of datasets for decision-making in relation to the organizational goals. The aim of this paper is to evaluate the level of the organizational goals achievement. In order to achieve this aim, we identify the dependency relationship between organizational data and organizational goals. Based on this dependency relationship, we define a metrics to analyse organizational data to be considered relevant to the organizational goals achievement. The case study is present to test the applicability of the methodology to measure the level of the Australian economy. The results show the dependency relationship between the case study goal and its variables. The contribution of this paper will serve as a new approach in evaluating the level of the organizational goals achievement

    Estimating parameters for probabilistic linkage of privacy-preserved datasets.

    Get PDF
    Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets

    Sixth Annual Users' Conference

    Get PDF
    Conference papers and presentation outlines which address the use of the Transportable Applications Executive (TAE) and its various applications programs are compiled. Emphasis is given to the design of the user interface and image processing workstation in general. Alternate ports of TAE and TAE subsystems are also covered

    Sociodemographic differences in linkage error: An examination of four large-scale datasets

    Get PDF
    © 2018 The Author(s). Background: Record linkage is an important tool for epidemiologists and health planners. Record linkage studies will generally contain some level of residual record linkage error, where individual records are either incorrectly marked as belonging to the same individual, or incorrectly marked as belonging to separate individuals. A key question is whether errors in linkage quality are distributed evenly throughout the population, or whether certain subgroups will exhibit higher rates of error. Previous investigations of this issue have typically compared linked and un-linked records, which can conflate bias caused by record linkage error, with bias caused by missing records (data capture errors). Methods: Four large administrative datasets were individually de-duplicated, with results compared to an available 'gold-standard' benchmark, allowing us to avoid methodological issues with comparing linked and un-linked records. Results were compared by gender, age, geographic remoteness (major cities, regional or remote) and socioeconomic status. Results: Results varied between datasets, and by sociodemographic characteristic. The most consistent findings were worse linkage quality for younger individuals (seen in all four datasets) and worse linkage quality for those living in remote areas (seen in three of four datasets). The linkage quality within sociodemographic categories varied between datasets, with the associations with linkage error reversed across different datasets due to quirks of the specific data collection mechanisms and data sharing practices. Conclusions: These results suggest caution should be taken both when linking younger individuals and those in remote areas, and when analysing linked data from these subgroups. Further research is required to determine the ramifications of worse linkage quality in these subpopulations on research outcomes
    • …
    corecore