22 research outputs found

    Finding Maternal Siblings in Birth Registration Data to form a Pregnancy Spine – Data Linkage & Graph Based Methods for Unknown Cluster Sizes

    Get PDF
    Introduction We have developed an innovative methodology to link maternal siblings within 2000 – 2005 England and Wales Birth Registration data, to form a Pregnancy Spine, a unification of all births to each unique mother. Key challenges in this many-many linkage scenario: • Blocking (reduction of record pair comparisons) • Cluster resolution Objectives and Approach Probabilistic data linkage (Python) was followed by generation of clusters (using igraph in R) and graph theory community detection techniques. To optimise geographical blocking and increase accuracy, we incorporated Internal Migration data to map the likely geographic movement of mothers between births. Maternal sibling clusters were modelled as a graph and the structure of clusters was optimised using community detection methods to link, split and evaluate sibling groups. Additionally, we incorporated additional childhood statistics data relating to child date of birth to evaluate likely accuracy of sibling pairs and remove false edges (links). Results Our development has resulted in a new blocking method and cluster resolution method. In addition, we developed new ways to assess and measure the accuracy of sibling groups, beyond traditional classifier metrics, and infer error rates. We applied our method to Registration Data used in earlier studies for QA of our methods. Using this, and by comparing against other statistics on maternal sibling composition we will present results which show that a high degree of accuracy (precision / recall and new checks) was obtained for precision, recall, and other evaluation metrics. Conclusion/Implications These methods will improve other linkage projects with unknown clusters sizes; for de-duplicating datasets, linkage of multiple datasets, or incorporation of data from a longer time-period through longitudinal linkage. To this Spine, researchers can now append and link other data sources to answer questions about maternal and child health outcomes

    Finding Maternal Siblings in Birth Registration Data to form a Pregnancy Spine – Data Linkage & Graph Based Methods for Unknown Cluster Sizes

    Get PDF
    We have developed an innovative methodology to link maternal siblings within 2000-2005 England and Wales Birth Registration data, to form a Pregnancy Spine, a unification of all births to each unique mother. Key challenges were Blocking & Cluster resolution. To optimise geographic blocking, Internal Migration data was incorporated to map likely geographic movement of mothers between births. Following probabilistic linkage, sibling clusters were modelled as a graph and their structure optimised using community detection methods. Childhood statistics data relating to child DOB were incorporated to evaluate accuracy and remove false links. Our development has resulted in a new blocking and cluster resolution method. We developed new ways to assess sibling group accuracy, beyond traditional classifier metrics, and infer error rates. We applied our method to Registration Data used in earlier studies for QA of our methods. Using this, and other maternal sibling composition statistics, we present results showing that a high degree of accuracy was obtained for standard and new evaluation metrics. These methods will improve other linkage projects linking unknown clusters sizes/multiple datasets, or longer time period longitudinal linkage. To this Spine, researchers can append and link other data sources to answer questions about maternal and child health outcomes

    COVID-19 transmission and infection: linkage of COVID-19 Infection Survey, Test and Trace, and Patient Demographics Survey.

    Get PDF
    Objectives Data linkage was conducted between the Office for National Statistics’ Covid Infection Survey (CIS), the Department of Health and Social Care’s Test and Trace (T&T) and NHS’ Personal Demographics Service (PDS) datasets. Linked data was required to provide reliable estimates of rates of COVID-19 transmission and infection used to inform policy regarding the ongoing pandemic. Approach The CIS was created to track infection rates in the UK population. Linking CIS participants to positive tests in T&T helped improve these estimates. Linkage to PDS was required to attach NHS number to these datasets to facilitate further linkages that could also be used to inform Government about the spread of the virus. Multiple approaches were used to link the data. Initially, T&T was linked to itself via a series of strict matchkeys to cluster records belonging to the same individual, to create a person level identifier. Subsequent linkage of CIS-PDS, T&T-PDS and CIS-T&T involved deterministic linkages with matchkeys designed and applied independently. A probabilistic (Fellegi-Sunter scoring) method was used to link CIS-PDS and CIS-T&T. Additional, associative links were created between CIS and T&T records that had matched to the same PDS record but had not matched to each other. Results The accuracy of CIS-PDS and CIS-T&T linkages was high (recall and precision >98%; all 95% lower confidence intervals >93%). A quality assessment of T&T-PDS is underway, as are relevant bias analyses. Conclusion As a result of this linkage, COVID-19 analysts have access to enriched datasets linked to compare previously separated variables, with confidence that the linkage method used was to required quality standards. The linked data has been used to provide crucial evidence to Government on infection and re-infection rates. Subsequent linkages have enabled analysts to explore risk factors associated with different variants of the virus, vaccination status and hospital episodes. Improvements continue to be made

    Sampling procedures for assessing accuracy of record linkage

    No full text
    The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census
    corecore