12 research outputs found
Examining the quality of record linkage process using nationwide Brazilian administrative databases to build a large birth cohort.
BACKGROUND: Research using linked routine population-based data collected for non-research purposes has increased in recent years because they are a rich and detailed source of information. The objective of this study is to present an approach to prepare and link data from administrative sources in a middle-income country, to estimate its quality and to identify potential sources of bias by comparing linked and non-linked individuals. METHODS: We linked two administrative datasets with data covering the period 2001 to 2015, using maternal attributes (name, age, date of birth, and municipally of residence) from Brazil: live birth information system and the 100 Million Brazilian Cohort (created using administrative records from over 114 million individuals whose families applied for social assistance via the Unified Register for Social Programmes) implementing an in house developed linkage tool CIDACS-RL. We then estimated the proportion of highly probably link and examined the characteristics of missed-matches to identify any potential source of bias. RESULTS: A total of 27,699,891 live births were submited to linkage with maternal information recorded in the baseline of the 100 Million Brazilian Cohort dataset of those, 16,447,414 (59.4%) children were found registered in the 100 Million Brazilian Cohort dataset. The proportion of highly probably link ranged from 39.3% in 2001 to 82.1% in 2014. A substantial improvement in the linkage after the introduction of maternal date of birth attribute, in 2011, was observed. Our analyses indicated a slightly higher proportion of missing data among missed matches and a higher proportion of people living in an urban area and self-declared as Caucasian among linked pairs when compared with non-linked sets. DISCUSSION: We demonstrated that CIDACS-RL is capable of performing high quality linkage even with a limited number of common attributes, using indexation as a blocking strategy in larg e routine databases from a middle-income country. However, residual records occurred more among people under worse living conditions. The results presented in this study reinforce the need of evaluating linkage quality and when necessary to take linkage error into account for the analyses of any generated dataset
CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability
Background: Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods: We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results: Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion: CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures
Ethnoracial inequalities and child mortality in Brazil: a nationwide longitudinal study of 19 million newborn babies
BACKGROUND: Racism is a social determinant of health inequities. In Brazil, racial injustices lead to poor outcomes in maternal and child health for Black and Indigenous populations, including greater risks of pregnancy-related complications; decreased access to antenatal, delivery, and postnatal care; and higher childhood mortality rates. In this study, we aimed to estimate inequalities in childhood mortality rates by maternal race and skin colour in a cohort of more than 19 million newborns in Brazil. METHODS: We did a nationwide population-based, retrospective cohort study using linked data on all births and deaths in Brazil between Jan 1, 2012, and Dec 31, 2018. The data consisted of livebirths followed up to age 5 years, death, or Dec 31, 2018. Data for livebirths were extracted from the National Information System for livebirths, SINASC, and for deaths from the Mortality Information System, SIM. The final sample consisted of complete data for all cases regarding maternal race and skin colour, and no inconsistencies were present between date of birth and death after linkage. We fitted Cox proportional hazard regression models to calculate the crude and adjusted hazard ratios (HRs) and 95% CIs for the association between maternal race and skin colour and all-cause and cause-specific younger than age 5 mortality rates, by age subgroups. We calculated the trend of HRs (and 95% CI) by time of observation (calendar year) to indicate trends in inequalities. FINDINGS: From the 20 526 714 livebirths registered in SINASC between Jan 1, 2012, and Dec 31, 2018, 238 436 were linked to death records identified from SIM. After linkage, 1 010 871 records were excluded due to missing data on maternal race or skin colour or inconsistent date of death. 19 515 843 livebirths were classified by mother's race, of which 224 213 died. Compared with children of White mothers, mortality risk for children younger than age 5 years was higher among children of Indigenous (HR 1·98 [95% CI 1·92-2·06]), Black (HR 1·39 [1·36-1·41]), and Brown or Mixed race (HR 1·19 [1·18-1·20]) mothers. The highest hazard ratios were observed during the post-neonatal period (Indigenous, HR 2·78 [95% CI 2·64-2·95], Black, HR 1·54 [1·48-1·59]), and Brown or Mixed race, HR 1·25 [1·23-1·27]) and between the ages of 1 year and 4 years (Indigenous, HR 3·82 [95% CI 3·52-4·15]), Black, HR 1·51 [1·42-1·60], and Brown or Mixed race, HR 1·30 [1·26-1·35]). Children of Indigenous (HR 16·39 [95% CI 12·88-20·85]), Black (HR 2·34 [1·78-3·06]), and Brown or Mixed race mothers (HR 2·05 [1·71-2·45]) had a higher risk of death from malnutrition than did children of White mothers. Similar patterns were observed for death from diarrhoea (Indigenous, HR 14·28 [95% CI 12·25-16·65]; Black, HR 1·72 [1·44-2·05]; and Brown or Mixed race mothers, HR 1·78 [1·61-1·98]) and influenza and pneumonia (Indigenous, HR 6·49 [95% CI 5·78-7·27]; Black, HR 1·78 [1·62-1·96]; and Brown or Mixed race mothers, HR 1·60 [1·51-1·69]). INTERPRETATION: Substantial ethnoracial inequalities were observed in child mortality in Brazil, especially among the Indigenous and Black populations. These findings demonstrate the importance of regular racial inequality assessments and monitoring. We suggest implementing policies to promote ethnoracial equity to reduce the impact of racism on child health. FUNDING: MCTI/CNPq/MS/SCTIE/Decit/Bill & Melinda Gates Foundation's Grandes Desafios Brasil, Desenvolvimento Saudável para Todas as Crianças, and Wellcome Trust core support grant awarded to CIDACS-Center for Data and Knowledge Integration for Health
Conditional cash transfer program and child mortality: A cross-sectional analysis nested within the 100 Million Brazilian Cohort.
BACKGROUND: Brazil has made great progress in reducing child mortality over the past decades, and a parcel of this achievement has been credited to the Bolsa Família program (BFP). We examined the association between being a BFP beneficiary and child mortality (1-4 years of age), also examining how this association differs by maternal race/skin color, gestational age at birth (term versus preterm), municipality income level, and index of quality of BFP management. METHODS AND FINDINGS: This is a cross-sectional analysis nested within the 100 Million Brazilian Cohort, a population-based cohort primarily built from Brazil's Unified Registry for Social Programs (Cadastro Único). We analyzed data from 6,309,366 children under 5 years of age whose families enrolled between 2006 and 2015. Through deterministic linkage with the BFP payroll datasets, and similarity linkage with the Brazilian Mortality Information System, 4,858,253 children were identified as beneficiaries (77%) and 1,451,113 (23%) were not. Our analysis consisted of a combination of kernel matching and weighted logistic regressions. After kernel matching, 5,308,989 (84.1%) children were included in the final weighted logistic analysis, with 4,107,920 (77.4%) of those being beneficiaries and 1,201,069 (22.6%) not, with a total of 14,897 linked deaths. Overall, BFP participation was associated with a reduction in child mortality (weighted odds ratio [OR] = 0.83; 95% CI: 0.79 to 0.88; p < 0.001). This association was stronger for preterm children (weighted OR = 0.78; 95% CI: 0.68 to 0.90; p < 0.001), children of Black mothers (weighted OR = 0.74; 95% CI: 0.57 to 0.97; p < 0.001), children living in municipalities in the lowest income quintile (first quintile of municipal income: weighted OR = 0.72; 95% CI: 0.62 to 0.82; p < 0.001), and municipalities with better index of BFP management (5th quintile of the Decentralized Management Index: weighted OR = 0.76; 95% CI: 0.66 to 0.88; p < 0.001). The main limitation of our methodology is that our propensity score approach does not account for possible unmeasured confounders. Furthermore, sensitivity analysis showed that loss of nameless death records before linkage may have resulted in overestimation of the associations between BFP participation and mortality, with loss of statistical significance in municipalities with greater losses of data and change in the direction of the association in municipalities with no losses. CONCLUSIONS: In this study, we observed a significant association between BFP participation and child mortality in children aged 1-4 years and found that this association was stronger for children living in municipalities in the lowest quintile of wealth, in municipalities with better index of program management, and also in preterm children and children of Black mothers. These findings reinforce the evidence that programs like BFP, already proven effective in poverty reduction, have a great potential to improve child health and survival. Subgroup analysis revealed heterogeneous results, useful for policy improvement and better targeting of BFP
Cohort profile: the 100 million Brazilian cohort
The creation of The 100 Million Brazilian Cohort was motivated by the availability of high quality but dispersed social and health databases in Brazil and the need to integrate data and evaluate the impact of policies aiming to improve the social determinants of health (e.g. social protection policies) on health outcomes, overall and in subgroups of interest in a dynamic cohort.
• The baseline of The 100 Million Brazilian Cohort comprises 131 697 800 low-income individuals in 35 358 415 families from 2011 to 2018. The Cohort population is mostly composed of children and young adults, with a higher proportion of females than the general Brazilian population, who identify themselves as Brown and live in the urban area of the country.
• Exposure to social protection and the follow-up of individuals are obtained through: (i) deterministic linkage using the Social Identification Number (NIS) to link the Cohort baseline to social protection programmes and to periodically renewed socioeconomic information in Cadatro U ́ nico datasets; and/or (ii) non-deterministic linkage using the CIDACS-RL non-deterministic linkage tool, to link the Cohort baseline to administrative health care datasets such as mortality (Mortality Information System, SIM), disease notification (Information System for Notifiable Diseases, SINAN), birth information (Live Birth Information System, SINASC) and nutrition status (Food and Nutrition Surveillance System, SISVAN).
• So far, studies have used The 100 Million Brazilian Cohort to investigate the socioeconomic and demographic determinants of leprosy, leprosy treatment outcomes and low birthweight and to evaluate the impact of the Bolsa Familia Programme (BFP) on leprosy and child mortality. Other studies are now being conducted that are of utmost relevance to the health inequalities of Brazil and many low- and middle-income countries, and many research opportunities are being opened up with the linkage of a range of health outcomes
Cohort Profile: Centro de Integração de Dados e Conhecimentos para Saúde (CIDACS) Birth Cohort.
No Abstract available. Declaration CIDAC
Design and evaluation of probabilistic record linkage methods supporting the Brazilian 100-million cohort initiative.
ABSTRACT
Background and aims
A cooperation Brazil-UK was set in mid-2013 aiming at to build a huge cohort comprised by individuals registered in CadastroÚnico (CADU), a socioeconomic database used in social programmes of the Brazilian government. Epidemiologists and statisticians wish to assess the impact of Bolsa Família (PBF), a conditional cash transfer programme, on the incidence of several diseases (tuberculosis, leprosy, HIV etc). The cohort must contain all individuals who received at least one payment from PBF between 2007 and 2012, which results in a 100-million records according to our preliminary analysis. These individuals must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization (SIH), notifiable diseases (SINAN), mortality (SIM), live births (SINASC), to produce data marts (domain-specific data) to the proposed studies. Within this cooperation, our first goal was to design and evaluate probabilistic methods to routine link the cohort, PBF, and SUS outcomes.
Approach
We implemented two probabilistic linkage methods: a full probabilistic, based on the Dice similarity (Sorensen index) of Bloom filters; and an hybrid approach, based on rules to deterministic and probabilistic matching. We performed linkages involving CADU (2011 extraction) and SUS outcomes (SIH, SINAN, and SIM) with samples from 3 states (Sergipe, Santa Catarina and Bahia) with an increasing size (from 1,447,512 to 12,036,010).
Results
Using a Dice between 0.90 and 0.92, our methods retrieved more than 95% of true positive pairs amongst the linked pairs. For Sergipe, we obtained as : , , , respectively for SIH, SINAN, and SIM. For Bahia: , , . Another linkage between CADU (1,447,512 records) and SINAN (624 records), for tuberculosis in Sergipe, returned 397 (full probabilistic) and 311 (hybrid) linked pairs, being 306 and 300 true positives. Another execution considering CADU (1,988,599 records) and SINAN (2,094 records), for tuberculosis in Santa Catarina, returned 791 (full probabilistic) and 500 (hybrid) linked pairs, with 667 and 472 true positives. Linking CADU (1.685,697 records) and SIM, for mortality of children under-4, returned 18 linked pairs, all of them true positives, for a Dice between 0.90 and 0.92 and with 100% of sensitivity, specificity, and positive predictive value.
Conclusion
Due to the absence of gold standards, we use samples with increasing sizes and manual review when adequate. Our results are quite accurate, although obtained with an unique extraction of CADU. We are starting to run linkages with the entire cohort
Assessing the accuracy of probabilistic record linkage of social and health databases in the 100 million Brazilian cohort
ABSTRACT
Background and aims
The Brazilian government has several social protection programmes that select their beneficiaries based on socioeconomic information kept in the CadastroÚnico (CADU) database. The CADU will be used to build a population-based cohort of approximately 100 million individuals. Among the social programmes is the Bolsa Família (PBF), a conditional cash transfer programme that provides extra income to poor families. These two databases must be deterministically linked to individuals who have received payments from PBF between 2004 and 2012. It will be used in epidemiological studies aiming to assess the impact of PBF on the occurrence and severity of several diseases and health problems (tuberculosis, leprosy, HIV, child health etc). This cohort must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization, notifiable diseases, mortality, and live births, in order to produce data marts (domain-specific data) to the proposed studies. Our goals comprise the validation of probabilistic record linkage methods to support this cohort setup.
Approach
This paper emphasizes the accuracy assessment of our methods based on the linkage of SIH (hospitalization), SINAN (notifications), and SIM (mortality) records to the 2011 extraction of CADU. We focused on hospitalization and notification of tuberculosis, as well infant mortality for all causes in under-4 children, for a small sample with 30,029 records (CADU). Due to the absence of gold standards, we used two approaches to assess accuracy: a clerical review and an automatic (tool-based) search. In the first case, we used different cut-off points as similarity index to calculate sensitivity and specificity, and a ROC curve to separate matched and non-matched pairs. The second approach retrieves from CADU all matched and non-matched pairs for a given individual, serving as a gold standard for validation.
Results
We retrieved 22 linked pairs, from which 18 are true positives for infant mortality (SIM database). From SINAN, our results were 434 linked pairs with 166 true positives, and with SIH, 121 linked pairs with 34 true positives. The sensitivity of manual scan for SIM (children mortality) ranges from 44% (specificity of 100%) to 95% (specificity of 94%), with similarity indices between 0.80 and 0.97, respectively. For automatic search, we obtained a sensitivity of 69.2% and specificity of 91.8%.
Conclusion
Our results show the need for a continuous improvement in our linkage routines and how to consistently evaluate their accuracy in the absence of adequate gold standards
On the Accuracy and Scalability of Probabilistic Data Linkage Over the Brazilian 114 Million Cohort
Submitted by Ana Maria Fiscina Sampaio ([email protected]) on 2018-05-14T14:15:59Z
No. of bitstreams: 1
Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5)Approved for entry into archive by Ana Maria Fiscina Sampaio ([email protected]) on 2018-05-14T16:16:59Z (GMT) No. of bitstreams: 1
Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5)Made available in DSpace on 2018-05-14T16:16:59Z (GMT). No. of bitstreams: 1
Pita R On the Accuracy and Scalability of Probabilistic ....pdf: 1096764 bytes, checksum: 00c3d76c863eee3c14952ae212d2fd30 (MD5)
Previous issue date: 2018CNPq, FINEP, FAPESB, Bill and Melinda Gates Foundation (OPP1161996), and The Royal Society (NF160879) and also supported by the National Institute for Health Research (RP-PG-040710314), Wellcome Trust (086091/Z/08/Z), and the Farr Institute of Health Informatics Research.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Federal University of Bahia. Institute of Mathematics and Statistics. Department of Statistics. Salvador, BA, Brazil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.Fundação Oswaldo Cruz. Instituto Gonçalo Moniz. Centro de Integração de Dados e Conhecimento para a Saúde. Salvador, BA, Brasil. Salvador, BA, Brasil / Universidade de São Paulo. São Paulo, SP, Brasil.University College London. Institute of Health Informatics. London, WC, UK.Federal University of Bahia. Institute of Mathematics and Statistics. Computer Science Department. Salvador, BA, Brazil.Data linkage refers to the process of identifying and linking records that refer to the same entity across multiple heterogeneous data sources. This method has been widely utilized across scientific domains, including public health where records from clinical, administrative, and other surveillance databases are aggregated and used for research, decision making, and assessment of public policies. When a common set of unique identifiers does not exist across sources, probabilistic linkage approaches are used to link records using a combination of attributes. These methods require a careful choice of comparison attributes as well as similarity metrics and cutoff values to decide if a given pair of records matches or not and for assessing the accuracy of the results. In large, complex datasets, linking and assessing accuracy can be challenging due to the volume and complexity of the data, the absence of a gold standard, and the challenges associated with manually reviewing a very large number of record matches. In this paper, we present AtyImo, a hybrid probabilistic linkage tool optimized for high accuracy and scalability in massive data sets. We describe the implementation details around anonymization, blocking, deterministic and probabilistic linkage, and accuracy assessment. We present results from linking a large population-based cohort of 114 million individuals in Brazil to public health and administrative databases for research. In controlled and real scenarios, we observed high accuracy of results: 93%-97% true matches. In terms of scalability, we present AtyImo's ability to link the entire cohort in less than nine days using Spark and scaling up to 20 million records in less than 12s over heterogeneous (CPU+GPU) architectures
Ethno-racial inequalities on adverse birth and neonatal outcomes: a nationwide, retrospective cohort study of 21 million Brazilian newbornsResearch in context
Summary: Background: Ethno-racial inequalities are critical determinants of health outcomes. We quantified ethnic-racial inequalities on adverse birth outcomes and early neonatal mortality in Brazil. Methods: We conducted a cohort study in Brazil using administrative linked data between 2012 and 2019. Estimated the attributable fractions for the entire population (PAF) and specific groups (AF), as the proportion of each adverse outcome that would have been avoided if all women had the same baseline conditions as White women, both unadjusted and adjusted for socioeconomics and maternal risk factors. AF was also calculated by comparing women from each maternal race/skin colour group in different groups of mothers’ schooling, with White women with 8 or more years of education as the reference group and by year. Findings: 21,261,936 newborns were studied. If all women experienced the same rate as White women, 1.7% of preterm births, 7.2% of low birth weight (LBW), 10.8% of small for gestational age (SGA) and 11.8% of early neonatal deaths would have been prevented. Percentages preventable were higher among Indigenous (22.2% of preterm births, 17.9% of LBW, 20.5% of SGA and 19.6% of early neonatal deaths) and Black women (6% of preterm births, 21.4% of LBW, 22.8% of SGA births and 20.1% of early neonatal deaths). AF was higher in groups with fewer years of education among Indigenous, Black and Parda for all outcomes. AF increased over time, especially among Indigenous populations. Interpretation: A considerable portion of adverse birth outcomes and neonatal deaths could be avoided if ethnic-racial inequalities were non-existent in Brazil. Acting on the causes of these inequalities must be central in maternal and child health policies. Funding: Bill & Melinda Gates Foundation and Wellcome Trust