135 research outputs found

    Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    Get PDF
    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data

    Understanding Views Around the Creation of a Consented, Donated Databank of Clinical Free Text to Develop and Train Natural Language Processing Models for Research: Focus Group Interviews With Stakeholders

    Get PDF
    BACKGROUND: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. OBJECTIVE: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. METHODS: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). RESULTS: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. CONCLUSIONS: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery

    Nitrate and nitrite contamination in drinking water and cancer risk: A systematic review with meta-analysis

    Get PDF
    BACKGROUND: Pollution of water sources, largely from wide-scale agricultural fertilizer use has resulted in nitrate and nitrite contamination of drinking water. The effects on human health of raised nitrate and nitrite levels in drinking water are currently unclear. OBJECTIVES: We conducted a systematic review of peer-reviewed literature on the association of nitrate and nitrite in drinking water with human health with a specific focus on cancer. METHODS: We searched eight databases from 1 January 1990 until 28 February 2021. Meta-analyses were conducted when studies had the same exposure metric and outcome. RESULTS: Of 9835 studies identified in the literature search, we found 111 studies reporting health outcomes, 60 of which reported cancer outcomes (38 case-control studies; 12 cohort studies; 10 other study designs). Most studies were set in the USA (24), Europe (20) and Taiwan (14), with only 3 studies from low and middle-income countries. Nitrate exposure in water (59 studies) was more commonly investigated than nitrite exposure (4 studies). Colorectal (15 studies) and gastric (13 studies) cancers were the most reported. In meta-analyses (4 studies) we identified a positive association of nitrate exposure with gastric cancer, OR = 1.91 (95%CI = 1.09-3.33) per 10 mg/L increment in nitrate ion. We found no association of nitrate exposure with colorectal cancer (10 studies; OR = 1.02 [95%CI = 0.96-1.08]) or cancers at any other site. CONCLUSIONS: We identified an association of nitrate in drinking water with gastric cancer but with no other cancer site. There is currently a paucity of robust studies from settings with high levels nitrate pollution in drinking water. Research into this area will be valuable to ascertain the true health burden of nitrate contamination of water and the need for public policies to protect human health

    Translating and evaluating historic phenotyping algorithms using SNOMED CT

    Get PDF
    OBJECTIVE: Patient phenotype definitions based on terminologies are required for the computational use of electronic health records. Within UK primary care research databases, such definitions have typically been represented as flat lists of Read terms, but Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) (a widely employed international reference terminology) enables the use of relationships between concepts, which could facilitate the phenotyping process. We implemented SNOMED CT-based phenotyping approaches and investigated their performance in the CPRD Aurum primary care database. MATERIALS AND METHODS: We developed SNOMED CT phenotype definitions for 3 exemplar diseases: diabetes mellitus, asthma, and heart failure, using 3 methods: "primary" (primary concept and its descendants), "extended" (primary concept, descendants, and additional relations), and "value set" (based on text searches of term descriptions). We also derived SNOMED CT codelists in a semiautomated manner for 276 disease phenotypes used in a study of health across the lifecourse. Cohorts selected using each codelist were compared to "gold standard" manually curated Read codelists in a sample of 500 000 patients from CPRD Aurum. RESULTS: SNOMED CT codelists selected a similar set of patients to Read, with F1 scores exceeding 0.93, and age and sex distributions were similar. The "value set" and "extended" codelists had slightly greater recall but lower precision than "primary" codelists. We were able to represent 257 of the 276 phenotypes by a single concept hierarchy, and for 135 phenotypes, the F1 score was greater than 0.9. CONCLUSIONS: SNOMED CT provides an efficient way to define disease phenotypes, resulting in similar patient populations to manually curated codelists

    Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease

    Get PDF
    Prognostic modelling is important in clinical practice and epidemiology for patient management and research. Electronic health records (EHR) provide large quantities of data for such models, but conventional epidemiological approaches require significant researcher time to implement. Expert selection of variables, fine-tuning of variable transformations and interactions, and imputing missing values are time-consuming and could bias subsequent analysis, particularly given that missingness in EHR is both high, and may carry meaning. Using a cohort of 80,000 patients from the CALIBER programme, we compared traditional modelling and machine-learning approaches in EHR. First, we used Cox models and random survival forests with and without imputation on 27 expert-selected, preprocessed variables to predict all-cause mortality. We then used Cox models, random forests and elastic net regression on an extended dataset with 586 variables to build prognostic models and identify novel prognostic factors without prior expert input. We observed that data-driven models used on an extended dataset can outperform conventional models for prognosis, without data preprocessing or imputing missing values. An elastic net Cox regression based with 586 unimputed variables with continuous values discretised achieved a C-index of 0.801 (bootstrapped 95% CI 0.799 to 0.802), compared to 0.793 (0.791 to 0.794) for a traditional Cox model comprising 27 expert-selected variables with imputation for missing values. We also found that data-driven models allow identification of novel prognostic variables; that the absence of values for particular variables carries meaning, and can have significant implications for prognosis; and that variables often have a nonlinear association with mortality, which discretised Cox models and random forests can elucidate. This demonstrates that machine-learning approaches applied to raw EHR data can be used to build models for use in research and clinical practice, and identify novel predictive variables and their effects to inform future research

    Long term health care use and costs in patients with stable coronary artery disease : a population based cohort using linked electronic health records (CALIBER)

    Get PDF
    Aims To examine long term health care utilisation and costs of patients with stable coronary artery disease (SCAD). Methods and results Linked cohort study of 94,966 patients with SCAD in England, 1st January 2001 to 31st March 2010, identified from primary care, secondary care, disease and death registries. Resource use and costs, and cost predictors by time and 5-year cardiovascular (CVD) risk profile were estimated using generalised linear models. Coronary heart disease hospitalisations were 20.5% in the first year and 66% in the year following a non-fatal (myocardial infarction, ischaemic or haemorrhagic stroke) event. Mean health care costs were £3,133 per patient in the first year and £10,377 in the year following a non-fatal event. First year predictors of cost included sex (mean cost £549 lower in females); SCAD diagnosis (NSTEMI cost £656 more than stable angina); and co-morbidities (heart failure cost £657 more per patient). Compared with lower risk patients (5-year CVD risk 3.5%), those of higher risk (5-year CVD risk 44.2%) had higher 5-year costs (£23,393 vs. £9,335) and lower lifetime costs (£43,020 vs. £116,888). Conclusion Patients with SCAD incur substantial health care utilisation and costs, which varies and may be predicted by 5-year CVD risk profile. Higher risk patients have higher initial but lower lifetime costs than lower risk patients as a result of shorter life expectancy. Improved cardiovascular survivorship among an ageing CVD population is likely to require stratified care in anticipation of the burgeoning demand

    Understanding views around the creation of a consented, donated databank of clinical free text to develop and train natural language processing models for research: focus group interviews with stakeholders

    Get PDF
    Background: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. Objective: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. Methods: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). Results: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. Conclusions: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery

    Use of Coronary Computed Tomographic Angiography to guide management of patients with coronary disease

    Get PDF
    Background In a prospective, multicenter, randomized controlled trial, 4,146 patients were randomized to receive standard care or standard care plus coronary computed tomography angiography (CCTA). Objectives The purpose of this study was to explore the consequences of CCTA-assisted diagnosis on invasive coronary angiography, preventive treatments, and clinical outcomes. Methods In post hoc analyses, we assessed changes in invasive coronary angiography, preventive treatments, and clinical outcomes using national electronic health records. Results Despite similar overall rates (409 vs. 401; p = 0.451), invasive angiography was less likely to demonstrate normal coronary arteries (20 vs. 56; hazard ratios [HRs]: 0.39 [95% confidence interval (CI): 0.23 to 0.68]; p < 0.001) but more likely to show obstructive coronary artery disease (283 vs. 230; HR: 1.29 [95% CI: 1.08 to 1.55]; p = 0.005) in those allocated to CCTA. More preventive therapies (283 vs. 74; HR: 4.03 [95% CI: 3.12 to 5.20]; p < 0.001) were initiated after CCTA, with each drug commencing at a median of 48 to 52 days after clinic attendance. From the median time for preventive therapy initiation (50 days), fatal and nonfatal myocardial infarction was halved in patients allocated to CCTA compared with those assigned to standard care (17 vs. 34; HR: 0.50 [95% CI: 0.28 to 0.88]; p = 0.020). Cumulative 6-month costs were slightly higher with CCTA: difference 462(95462 (95% CI: 303 to $621). Conclusions In patients with suspected angina due to coronary heart disease, CCTA leads to more appropriate use of invasive angiography and alterations in preventive therapies that were associated with a halving of fatal and non-fatal myocardial infarction. (Scottish COmputed Tomography of the HEART Trial [SCOT-HEART]; NCT01149590

    UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER

    Get PDF
    Objective: Electronic health records (EHRs) are a rich source of information on human diseases, but the information is variably structured, fragmented, curated using different coding systems, and collected for purposes other than medical research. We describe an approach for developing, validating, and sharing reproducible phenotypes from national structured EHR in the United Kingdom with applications for translational research. Materials and Methods: We implemented a rule-based phenotyping framework, with up to 6 approaches of validation. We applied our framework to a sample of 15 million individuals in a national EHR data source (population-based primary care, all ages) linked to hospitalization and death records in England. Data comprised continuous measurements (for example, blood pressure; medication information; coded diagnoses, symptoms, procedures, and referrals), recorded using 5 controlled clinical terminologies: (1) read (primary care, subset of SNOMED-CT [Systematized Nomenclature of Medicine Clinical Terms]), (2) International Classification of Diseases–Ninth Revision and Tenth Revision (secondary care diagnoses and cause of mortality), (3) Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures, Fourth Revision (hospital surgical procedures), and (4) DMþD prescription codes. Results: Using the CALIBER phenotyping framework, we created algorithms for 51 diseases, syndromes, biomarkers, and lifestyle risk factors and provide up to 6 validation approaches. The EHR phenotypes are curated in the open-access CALIBER Portal (https://www.caliberresearch.org/portal) and have been used by 40 national and international research groups in 60 peer-reviewed publications. Conclusions: We describe a UK EHR phenomics approach within the CALIBER EHR data platform with initial evidence of validity and use, as an important step toward international use of UK EHR data for health research
    corecore