22 research outputs found

    The effect of data cleaning on record linkage quality

    Get PDF
    Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.Methods: A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.Results: Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.Conclusions: Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process

    Automated Cryptanalysis of Bloom Filter Encryptions of Health Records

    Full text link
    Privacy-preserving record linkage with Bloom filters has become increasingly popular in medical applications, since Bloom filters allow for probabilistic linkage of sensitive personal data. However, since evidence indicates that Bloom filters lack sufficiently high security where strong security guarantees are required, several suggestions for their improvement have been made in literature. One of those improvements proposes the storage of several identifiers in one single Bloom filter. In this paper we present an automated cryptanalysis of this Bloom filter variant. The three steps of this procedure constitute our main contributions: (1) a new method for the detection of Bloom filter encrytions of bigrams (so-called atoms), (2) the use of an optimization algorithm for the assignment of atoms to bigrams, (3) the reconstruction of the original attribute values by linkage against bigram sets obtained from lists of frequent attribute values in the underlying population. To sum up, our attack provides the first convincing attack on Bloom filter encryptions of records built from more than one identifier.Comment: Contribution to the 8th International Conference on Health Informatics, Lisbon 201

    Record linkage under suboptimal conditions for data-intensive evaluation of primary care in Rio de Janeiro, Brazil

    Get PDF
    Background Linking Brazilian databases demands the development of algorithms and processes to deal with various challenges including the large size of the databases, the low number and poor quality of personal identifiers available to be compared (national security number not mandatory), and some characteristics of Brazilian names that make the linkage process prone to errors. This study aims to describe and evaluate the quality of the processes used to create an individual-linked database for data-intensive research on the impacts on health indicators of the expansion of primary care in Rio de Janeiro City, Brazil. Methods We created an individual-level dataset linking social benefits recipients, primary health care, hospital admission and mortality data. The databases were pre-processed, and we adopted a multiple approach strategy combining deterministic and probabilistic record linkage techniques, and an extensive clerical review of the potential matches. Relying on manual review as the gold standard, we estimated the false match (false-positive) proportion of each approach (deterministic, probabilistic, clerical review) and the missed match proportion (false-negative) of the clerical review approach. To assess the sensitivity (recall) to identifying social benefits recipients’ deaths, we used their vital status registered on the primary care database as the gold standard. Results In all linkage processes, the deterministic approach identified most of the matches. However, the proportion of matches identified in each approach varied. The false match proportion was around 1% or less in almost all approaches. The missed match proportion in the clerical review approach of all linkage processes were under 3%. We estimated a recall of 93.6% (95% CI 92.8–94.3) for the linkage between social benefits recipients and mortality data. Conclusion The adoption of a linkage strategy combining pre-processing routines, deterministic, and probabilistic strategies, as well as an extensive clerical review approach minimized linkage errors in the context of suboptimal data quality

    Probabilistic integration of large Brazilian socioeconomic and clinical databases

    Get PDF
    The integration of disparate large and heterogeneous socioeconomic and clinical databases is considered essential to capture and model longitudinal and social aspects of diseases. However, such integration is challenging: databases are stored in disparate locations, make use of different identifiers, have variable data quality, record information in bespoke purpose-specific formats and have different levels of metadata. Novel computational methods are required to integrate them and enable their statistical analyses for epidemiological research purposes. In this paper, we describe a probabilistic approach for constructing a very large population-based cohort comprised of 114 million individuals using linkages between clinical databases from the National Health System and administrative databases from governmental social programmes. We present our data integration model for creating data marts (epidemiological data) and discuss our evaluation results in controlled and uncontrolled scenarios, which demonstrate that our model and tools achieve high accuracy (minimum of 91%) in different probabilistic data integration scenarios

    Awareness of automated external defibrillators in the community: a local study

    Get PDF
    Automated external defibrillators (AEDs) are an important part of the chain of survival and provide a valuable first response to cardiac arrest. The National Defibrillator Programme instigated the installation of AEDs across England, but there is a need for greater local evidence concerning their installation. The aim of this study was to investigate the current status of AED provision within a single district in a county located in southwest UK. A mixed-methods study was undertaken including a quantitative survey and qualitative interviews. In total, 182 surveys were completed and seven interviews were undertaken with participants representing local organisations. Less than one third of organisations had installed AEDs and people were not clear about where the nearest AED was situated. Further awareness must be raised in order to develop public knowledge and confidence concerning the location, role and use of community AEDs

    The quality of record linkage between population-based birth and children’s early child development and school test result (NAPLAN) records in New South Wales, Australia

    Get PDF
    This study aimed to describe the utility of probabilistic record linkage of development and school performance data to a large population-based birth cohort and other administrative health datasets, and to assess whether any systematic differences exist between the records that did and did not link in each datase
    corecore