10 research outputs found

    A Spark-based workflow for probabilistic record linkage of healthcare data *

    Get PDF
    ABSTRACT Several areas, such as science, economics, finance, business intelligence, health, and others are exploring big data as a way to produce new information, make better decisions, and move forward their related technologies and systems. Specifically in health, big data represents a challenging problem due to the poor quality of data in some circumstances and the need to retrieve, aggregate, and process a huge amount of data from disparate databases. In this work, we focused on Brazilian Public Health System and on large databases from Ministry of Health and Ministry of Social Development and Hunger Alleviation. We present our Spark-based approach to data processing and probabilistic record linkage of such databases in order to produce very accurate data marts. These data marts are used by statisticians and epidemiologists to assess the effectiveness of conditional cash transfer programs to poor families in respect with the occurrence of some diseases (tuberculosis, leprosy, and AIDS). The case study we made as a proof-of-concept presents a good performance with accurate results. For comparison, we also discuss an OpenMP-based implementation

    Examining the quality of record linkage process using nationwide Brazilian administrative databases to build a large birth cohort.

    Get PDF
    BACKGROUND: Research using linked routine population-based data collected for non-research purposes has increased in recent years because they are a rich and detailed source of information. The objective of this study is to present an approach to prepare and link data from administrative sources in a middle-income country, to estimate its quality and to identify potential sources of bias by comparing linked and non-linked individuals. METHODS: We linked two administrative datasets with data covering the period 2001 to 2015, using maternal attributes (name, age, date of birth, and municipally of residence) from Brazil: live birth information system and the 100 Million Brazilian Cohort (created using administrative records from over 114 million individuals whose families applied for social assistance via the Unified Register for Social Programmes) implementing an in house developed linkage tool CIDACS-RL. We then estimated the proportion of highly probably link and examined the characteristics of missed-matches to identify any potential source of bias. RESULTS: A total of 27,699,891 live births were submited to linkage with maternal information recorded in the baseline of the 100 Million Brazilian Cohort dataset of those, 16,447,414 (59.4%) children were found registered in the 100 Million Brazilian Cohort dataset. The proportion of highly probably link ranged from 39.3% in 2001 to 82.1% in 2014. A substantial improvement in the linkage after the introduction of maternal date of birth attribute, in 2011, was observed. Our analyses indicated a slightly higher proportion of missing data among missed matches and a higher proportion of people living in an urban area and self-declared as Caucasian among linked pairs when compared with non-linked sets. DISCUSSION: We demonstrated that CIDACS-RL is capable of performing high quality linkage even with a limited number of common attributes, using indexation as a blocking strategy in larg e routine databases from a middle-income country. However, residual records occurred more among people under worse living conditions. The results presented in this study reinforce the need of evaluating linkage quality and when necessary to take linkage error into account for the analyses of any generated dataset

    CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

    Get PDF
    Background: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures

    National data linkage assessment of live births and deaths in Mexico: Estimating under-five mortality rate ratios for vulnerable newborns and trends from 2008 to 2019

    Get PDF
    BACKGROUND: Linked datasets that enable longitudinal assessments are scarce in low and middle-income countries. OBJECTIVES: We aimed to assess the linkage of administrative databases of live births and under-five child deaths to explore mortality and trends for preterm, small (SGA) and large for gestational age (LGA) in Mexico. METHODS: We linked individual-level datasets collected by National statistics from 2008 to 2019. Linkage was performed based on agreement on birthday, sex, residential address. We used the Centre for Data and Knowledge Integration for Health software to identify the best candidate pairs based on similarity. Accuracy was assessed by calculating the area under the receiver operating characteristic curve. We evaluated completeness by comparing the number of linked records with reported deaths. We described the percentage of linked records by baseline characteristics to identify potential bias. Using the linked dataset, we calculated mortality rate ratios (RR) in neonatal, infants, and children under-five according to gestational age, birthweight, and size. RESULTS: For the period 2008-2019, a total of 24,955,172 live births and 321,165 under-five deaths were available for linkage. We excluded 1,539,046 records (6.2%) with missing or implausible values. We succesfully linked 231,765 deaths (72.2%: range 57.1% in 2009 and 84.3% in 2011). The rate of neonatal mortality was higher for preterm compared with term (RR 3.83, 95% confidence interval, [CI] 3.78, 3.88) and for SGA compared with appropriate for gestational age (AGA) (RR 1.22 95% CI, 1.19, 1.24). Births at <28 weeks had the highest mortality (RR 35.92, 95% CI, 34.97, 36.88). LGA had no additional risk vs AGA among children under five (RR 0.92, 95% CI, 0.90, 0.93). CONCLUSIONS: We demonstrated the utility of linked data to understand neonatal vulnerability and child mortality. We created a linked dataset that would be a valuable resource for future population-based research

    Cohort profile: the 100 million Brazilian cohort

    Get PDF
    The creation of The 100 Million Brazilian Cohort was motivated by the availability of high quality but dispersed social and health databases in Brazil and the need to integrate data and evaluate the impact of policies aiming to improve the social determinants of health (e.g. social protection policies) on health outcomes, overall and in subgroups of interest in a dynamic cohort. • The baseline of The 100 Million Brazilian Cohort comprises 131 697 800 low-income individuals in 35 358 415 families from 2011 to 2018. The Cohort population is mostly composed of children and young adults, with a higher proportion of females than the general Brazilian population, who identify themselves as Brown and live in the urban area of the country. • Exposure to social protection and the follow-up of individuals are obtained through: (i) deterministic linkage using the Social Identification Number (NIS) to link the Cohort baseline to social protection programmes and to periodically renewed socioeconomic information in Cadatro U ́ nico datasets; and/or (ii) non-deterministic linkage using the CIDACS-RL non-deterministic linkage tool, to link the Cohort baseline to administrative health care datasets such as mortality (Mortality Information System, SIM), disease notification (Information System for Notifiable Diseases, SINAN), birth information (Live Birth Information System, SINASC) and nutrition status (Food and Nutrition Surveillance System, SISVAN). • So far, studies have used The 100 Million Brazilian Cohort to investigate the socioeconomic and demographic determinants of leprosy, leprosy treatment outcomes and low birthweight and to evaluate the impact of the Bolsa Familia Programme (BFP) on leprosy and child mortality. Other studies are now being conducted that are of utmost relevance to the health inequalities of Brazil and many low- and middle-income countries, and many research opportunities are being opened up with the linkage of a range of health outcomes

    Administrative Data Linkage in Brazil: Potentials for Health Technology Assessment.

    Get PDF
    Health technology assessment (HTA) is the systematic evaluation of the properties and impacts of health technologies and interventions. In this article, we presented a discussion of HTA and its evolution in Brazil, as well as a description of secondary data sources available in Brazil with potential applications to generate evidence for HTA and policy decisions. Furthermore, we highlighted record linkage, ongoing record linkage initiatives in Brazil, and the main linkage tools developed and/or used in Brazilian data. Finally, we discussed the challenges and opportunities of using secondary data for research in the Brazilian context. In conclusion, we emphasized the availability of high quality data and an open, modern attitude toward the use of data for research and policy. This is supported by a rigorous but enabling legal framework that will allow the conduct of large-scale observational studies to evaluate clinical, economical, and social impacts of health technologies and social policies

    Design and evaluation of probabilistic record linkage methods supporting the Brazilian 100-million cohort initiative.

    No full text
    ABSTRACT Background and aims A cooperation Brazil-UK was set in mid-2013 aiming at to build a huge cohort comprised by individuals registered in CadastroÚnico (CADU), a socioeconomic database used in social programmes of the Brazilian government. Epidemiologists and statisticians wish to assess the impact of Bolsa Família (PBF), a conditional cash transfer programme, on the incidence of several diseases (tuberculosis, leprosy, HIV etc). The cohort must contain all individuals who received at least one payment from PBF between 2007 and 2012, which results in a 100-million records according to our preliminary analysis. These individuals must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization (SIH), notifiable diseases (SINAN), mortality (SIM), live births (SINASC), to produce data marts (domain-specific data) to the proposed studies. Within this cooperation, our first goal was to design and evaluate probabilistic methods to routine link the cohort, PBF, and SUS outcomes. Approach We implemented two probabilistic linkage methods: a full probabilistic, based on the Dice similarity (Sorensen index) of Bloom filters; and an hybrid approach, based on rules to deterministic and probabilistic matching. We performed linkages involving CADU (2011 extraction) and SUS outcomes (SIH, SINAN, and SIM) with samples from 3 states (Sergipe, Santa Catarina and Bahia) with an increasing size (from 1,447,512 to 12,036,010). Results Using a Dice between 0.90 and 0.92, our methods retrieved more than 95% of true positive pairs amongst the linked pairs. For Sergipe, we obtained as : , , , respectively for SIH, SINAN, and SIM. For Bahia: , , . Another linkage between CADU (1,447,512 records) and SINAN (624 records), for tuberculosis in Sergipe, returned 397 (full probabilistic) and 311 (hybrid) linked pairs, being 306 and 300 true positives. Another execution considering CADU (1,988,599 records) and SINAN (2,094 records), for tuberculosis in Santa Catarina, returned 791 (full probabilistic) and 500 (hybrid) linked pairs, with 667 and 472 true positives. Linking CADU (1.685,697 records) and SIM, for mortality of children under-4, returned 18 linked pairs, all of them true positives, for a Dice between 0.90 and 0.92 and with 100% of sensitivity, specificity, and positive predictive value. Conclusion Due to the absence of gold standards, we use samples with increasing sizes and manual review when adequate. Our results are quite accurate, although obtained with an unique extraction of CADU. We are starting to run linkages with the entire cohort

    On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort

    Get PDF
    Submitted by Ana Maria Fiscina Sampaio ([email protected]) on 2018-05-14T14:15:59Z No. of bitstreams: 1 Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5)Approved for entry into archive by Ana Maria Fiscina Sampaio ([email protected]) on 2018-05-14T16:16:59Z (GMT) No. of bitstreams: 1 Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5)Made available in DSpace on 2018-05-14T16:16:59Z (GMT). No. of bitstreams: 1 Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5) Previous issue date: 2018CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.University College London. Institute of Health Informatics. London, WC, UK.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures
    corecore