80 research outputs found
Assessing record linkage between health care and Vital Statistics databases using deterministic methods
BACKGROUND: We assessed the linkage and correct linkage rate using deterministic record linkage among three commonly used Canadian databases, namely, the population registry, hospital discharge data and Vital Statistics registry. METHODS: Three combinations of four personal identifiers (surname, first name, sex and date of birth) were used to determine the optimal combination. The correct linkage rate was assessed using a unique personal health number available in all three databases. RESULTS: Among the three combinations, the combination of surname, sex, and date of birth had the highest linkage rate of 88.0% and 93.1%, and the second highest correct linkage rate of 96.9% and 98.9% between the population registry and Vital Statistics registry, and between the hospital discharge data and Vital Statistics registry in 2001, respectively. Adding the first name to the combination of the three identifiers above increased correct linkage by less than 1%, but at the cost of lowering the linkage rate almost by 10%. CONCLUSION: Our findings suggest that the combination of surname, sex and date of birth appears to be optimal using deterministic linkage. The linkage and correct linkage rates appear to vary by age and the type of database, but not by sex
A review for clinical outcomes research: hypothesis generation, data strategy, and hypothesis-driven statistical analysis
In recent years, more and more large, population-level databases have become available for clinical research. The size and complexity of these databases often present a methodological challenge for investigators. We propose that a “protocol” may facilitate the research process using these databases. In addition, much like the structured History and Physical (H&P) helps the audience appreciate the details of a patient case more systematically, a formal outcomes research protocol can also help in the systematic evaluation of an outcomes research manuscript
Technical challenges of providing record linkage services for research
Background: Record linkage techniques are widely used to enable health researchers to gain event based longitudinal information for entire populations. The task of record linkage is increasingly being undertaken by specialised linkage units (SLUs). In addition to the complexity of undertaking probabilistic record linkage, these units face additional technical challenges in providing record linkage ‘as a service’ for research. The extent of this functionality, and approaches to solving these issues, has had little focus in the record linkage literature. Few, if any, of the record linkage packages or systems currently used by SLUs include the full range of functions required. Methods: This paper identifies and discusses some of the functions that are required or undertaken by SLUs in the provision of record linkage services. These include managing routine, on-going linkage; storing and handling changing data; handling different linkage scenarios; accommodating ever increasing datasets. Automated linkage processes are one way of ensuring consistency of results and scalability of service. Results: Alternative solutions to some of these challenges are presented. By maintaining a full history of links, and storing pairwise information, many of the challenges around handling ‘open’ records, and providing automated managed extractions are solved. A number of these solutions were implemented as part of the development of the National Linkage System (NLS) by the Centre for Data Linkage (part of the Population Health Research Network) in Australia.Conclusions: The demand for, and complexity of, linkage services are growing. This presents as a challenge to SLUs as they seek to service the varying needs of dozens of research projects annually. Linkage units need to be both flexible and scalable to meet this demand. It is hoped the solutions presented here can help mitigate these difficulties
Medical record linkage in health information systems by approximate string matching and clustering
BACKGROUND: Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS: The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS: The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION: Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity
A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage
Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results
Accuracy and completeness of patient pathways – the benefits of national data linkage in Australia
Background - The technical challenges associated with national data linkage, and the extent of cross-border population movements, are explored as part of a pioneering research project. The project involved linking state-based hospital admission records and death registrations across Australia for a national study of hospital related deaths. Methods - The project linked over 44 million morbidity and mortality records from four Australian states between 1st July 1999 and 31st December 2009 using probabilistic methods. The accuracy of the linkage was measured through a comparison with jurisdictional keys sourced from individual states. The extent of cross-border population movement between these states was also assessed. Results - Data matching identified almost twelve million individuals across the four Australian states. The percentage of individuals from one state with records found in another ranged from 3-5 %. Using jurisdictional keys to measure linkage quality, results indicate a high matching efficiency (F measure 97 to 99 %), with linkage processing taking only a matter of days. Conclusions - The results demonstrate the feasibility and accuracy of undertaking cross jurisdictional linkage for national research. The benefits are substantial, particularly in relation to capturing the full complement of records in patient pathways as a result of cross-border population movements. The project identified a sizeable ‘mobile’ population with hospital records in more than one state. Research studies that focus on a single jurisdiction will under-enumerate the extent of hospital usage by individuals in the population. It is important that researchers understand and are aware of the impact of this missing hospital activity on their studies. The project highlights the need for an efficient and accurate data linkage system to support national research across Australia
Quality and complexity measures for data linkage and deduplication
Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures
How good is probabilistic record linkage to reconstruct reproductive histories? Results from the Aberdeen children of the 1950s study
BACKGROUND: Probabilistic record linkage is widely used in epidemiology, but studies of its validity are rare. Our aim was to validate its use to identify births to a cohort of women, being drawn from a large cohort of people born in Scotland in the early 1950s. METHODS: The Children of the 1950s cohort includes 5868 females born in Aberdeen 1950–56 who were in primary schools in the city in 1962. In 2001 a postal questionnaire was sent to the cohort members resident in the UK requesting information on offspring. Probabilistic record linkage (based on surname, maiden name, initials, date of birth and postcode) was used to link the females in the cohort to birth records held by the Scottish Maternity Record System (SMR 2). RESULTS: We attempted to mail a total of 5540 women; 3752 (68%) returned a completed questionnaire. Of these 86% reported having had at least one birth. Linkage to SMR 2 was attempted for 5634 women, one or more maternity records were found for 3743. There were 2604 women who reported at least one birth in the questionnaire and who were linked to one or more SMR 2 records. When judged against the questionnaire information, the linkage correctly identified 4930 births and missed 601 others. These mostly occurred outside of Scotland (147) or prior to full coverage by SMR 2 (454). There were 134 births incorrectly linked to SMR 2. CONCLUSION: Probabilistic record linkage to routine maternity records applied to population-based cohort, using name, date of birth and place of residence, can have high specificity, and as such may be reliably used in epidemiological research
A retrospective population-based study of childhood hospital admissions with record linkage to a birth defects registry
<p>Abstract</p> <p>Background</p> <p>Using population-based linked records of births, deaths, birth defects and hospital admissions for children born 1980–1999 enables profiles of hospital morbidity to be created for each child.</p> <p>Methods</p> <p>This is an analysis of a state-based registry of birth defects linked to population-based hospital admission data. Transfers and readmissions within one day could be taken into account and treated as one episode of care for the purposes of analyses (N = 485,446 children; 742,845 non-birth admissions).</p> <p>Results</p> <p>Children born in Western Australia from 1980–1999 with a major birth defect comprised 4.6% of live births but 12.0% of non-birth hospital admissions from 1980–2000. On average, the children with a major birth defect remained in hospital longer than the children in the comparison group for the same diagnosis. The mean and median lengths of stay (LOS) for admissions before the age of 5 years have decreased for all children since 1980. However, the mean number of admissions per child admitted has remained constant at around 3.8 admissions for children with a major birth defect and 2.2 admissions for all other children.</p> <p>Conclusion</p> <p>To gain a true picture of the burden of hospital-based morbidity in childhood, admission records need to be linked for each child. We have been able to do this at a population level using birth defect cases ascertained by a birth defects registry. Our results showed a greater mean LOS and mean number of admissions per child admitted than previous studies. The results suggest there may be an opportunity for the children with a major birth defect to be monitored and seen earlier in the primary care setting for common childhood illnesses to avoid hospitalisation or reduce the LOS.</p
Smc5/6 coordinates formation and resolution of joint molecules with chromosome morphology to ensure meiotic divisions
During meiosis, Structural Maintenance of Chromosome (SMC) complexes underpin two fundamental features of meiosis: homologous recombination and chromosome segregation. While meiotic functions of the cohesin and condensin complexes have been delineated, the role of the third SMC complex, Smc5/6, remains enigmatic. Here we identify specific, essential meiotic functions for the Smc5/6 complex in homologous recombination and the regulation of cohesin. We show that Smc5/6 is enriched at centromeres and cohesin-association sites where it regulates sister-chromatid cohesion and the timely removal of cohesin from chromosomal arms, respectively. Smc5/6 also localizes to recombination hotspots, where it promotes normal formation and resolution of a subset of joint-molecule intermediates. In this regard, Smc5/6 functions independently of the major crossover pathway defined by the MutLγ complex. Furthermore, we show that Smc5/6 is required for stable chromosomal localization of the XPF-family endonuclease, Mus81-Mms4Eme1. Our data suggest that the Smc5/6 complex is required for specific recombination and chromosomal processes throughout meiosis and that in its absence, attempts at cell division with unresolved joint molecules and residual cohesin lead to severe recombination-induced meiotic catastroph
- …