157 research outputs found

    Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    Get PDF
    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data

    Digital technology and patient and public involvement (PPI) in routine care and clinical research-A pilot study

    Get PDF
    BACKGROUND: Patient and public involvement (PPI) has growing impact on the design of clinical care and research studies. There remains underreporting of formal PPI events including views related to using digital tools. This study aimed to assess the feasibility of hosting a hybrid PPI event to gather views on the use of digital tools in clinical care and research. METHODS: A PPI focus day was held following local procedures and published recommendations related to advertisement, communication and delivery. Two exemplar projects were used as the basis for discussions and qualitative and quantitative data was collected. RESULTS: 32 individuals expressed interest in the PPI day and 9 were selected to attend. 3 participated in person and 6 via an online video-calling platform. Selected written and verbal feedback was collected on two digitally themed projects and on the event itself. The overall quality and interactivity for the event was rated as 4/5 for those who attended in person and 4.5/5 and 4.8/5 respectively, for those who attended remotely. CONCLUSIONS: A hybrid PPI event is feasible and offers a flexible format to capture the views of patients. The overall enthusiasm for digital tools amongst patients in routine care and clinical research is high, though further work and standardised, systematic reporting of PPI events is required

    Understanding Views Around the Creation of a Consented, Donated Databank of Clinical Free Text to Develop and Train Natural Language Processing Models for Research: Focus Group Interviews With Stakeholders

    Get PDF
    BACKGROUND: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. OBJECTIVE: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. METHODS: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). RESULTS: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. CONCLUSIONS: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery

    Nitrate and nitrite contamination in drinking water and cancer risk: A systematic review with meta-analysis

    Get PDF
    BACKGROUND: Pollution of water sources, largely from wide-scale agricultural fertilizer use has resulted in nitrate and nitrite contamination of drinking water. The effects on human health of raised nitrate and nitrite levels in drinking water are currently unclear. OBJECTIVES: We conducted a systematic review of peer-reviewed literature on the association of nitrate and nitrite in drinking water with human health with a specific focus on cancer. METHODS: We searched eight databases from 1 January 1990 until 28 February 2021. Meta-analyses were conducted when studies had the same exposure metric and outcome. RESULTS: Of 9835 studies identified in the literature search, we found 111 studies reporting health outcomes, 60 of which reported cancer outcomes (38 case-control studies; 12 cohort studies; 10 other study designs). Most studies were set in the USA (24), Europe (20) and Taiwan (14), with only 3 studies from low and middle-income countries. Nitrate exposure in water (59 studies) was more commonly investigated than nitrite exposure (4 studies). Colorectal (15 studies) and gastric (13 studies) cancers were the most reported. In meta-analyses (4 studies) we identified a positive association of nitrate exposure with gastric cancer, OR = 1.91 (95%CI = 1.09-3.33) per 10 mg/L increment in nitrate ion. We found no association of nitrate exposure with colorectal cancer (10 studies; OR = 1.02 [95%CI = 0.96-1.08]) or cancers at any other site. CONCLUSIONS: We identified an association of nitrate in drinking water with gastric cancer but with no other cancer site. There is currently a paucity of robust studies from settings with high levels nitrate pollution in drinking water. Research into this area will be valuable to ascertain the true health burden of nitrate contamination of water and the need for public policies to protect human health

    Translating and evaluating historic phenotyping algorithms using SNOMED CT

    Get PDF
    OBJECTIVE: Patient phenotype definitions based on terminologies are required for the computational use of electronic health records. Within UK primary care research databases, such definitions have typically been represented as flat lists of Read terms, but Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) (a widely employed international reference terminology) enables the use of relationships between concepts, which could facilitate the phenotyping process. We implemented SNOMED CT-based phenotyping approaches and investigated their performance in the CPRD Aurum primary care database. MATERIALS AND METHODS: We developed SNOMED CT phenotype definitions for 3 exemplar diseases: diabetes mellitus, asthma, and heart failure, using 3 methods: "primary" (primary concept and its descendants), "extended" (primary concept, descendants, and additional relations), and "value set" (based on text searches of term descriptions). We also derived SNOMED CT codelists in a semiautomated manner for 276 disease phenotypes used in a study of health across the lifecourse. Cohorts selected using each codelist were compared to "gold standard" manually curated Read codelists in a sample of 500 000 patients from CPRD Aurum. RESULTS: SNOMED CT codelists selected a similar set of patients to Read, with F1 scores exceeding 0.93, and age and sex distributions were similar. The "value set" and "extended" codelists had slightly greater recall but lower precision than "primary" codelists. We were able to represent 257 of the 276 phenotypes by a single concept hierarchy, and for 135 phenotypes, the F1 score was greater than 0.9. CONCLUSIONS: SNOMED CT provides an efficient way to define disease phenotypes, resulting in similar patient populations to manually curated codelists

    Subtypes of atrial fibrillation with concomitant valvular heart disease derived from electronic health records:phenotypes, population prevalence, trends and prognosis

    Get PDF
    AIMS: To evaluate population-based electronic health record (EHR) definitions of atrial fibrillation (AF) and valvular heart disease (VHD) subtypes, time trends in prevalence and prognosis. METHODS AND RESULTS: A total of 76 019 individuals with AF were identified in England in 1998-2010 in the CALIBER resource, linking primary and secondary care EHR. An algorithm was created, implemented, and refined to identify 18 VHD subtypes using 406 diagnosis, procedure, and prescription codes. Cox models were used to investigate associations with a composite endpoint of incident stroke (ischaemic, haemorrhagic, and unspecified), systemic embolism (SSE), and all-cause mortality. Among individuals with AF, the prevalence of AF with concomitant VHD increased from 11.4% (527/4613) in 1998 to 17.6% (7014/39 868) in 2010 and also in individuals aged over 65 years. Those with mechanical valves, mitral stenosis (MS), or aortic stenosis had highest risk of clinical events compared to AF patients with no VHD, in relative [hazard ratio (95% confidence interval): 1.13 (1.02-1.24), 1.20 (1.05-1.36), and 1.27 (1.19-1.37), respectively] and absolute (excess risk: 2.04, 4.20, and 6.37 per 100 person-years, respectively) terms. Of the 95.2% of individuals with indication for warfarin (men and women with CHA2DS2-VASc ≥1 and ≥2, respectively), only 21.8% had a prescription 90 days prior to the study. CONCLUSION: Prevalence of VHD among individuals with AF increased from 1998 to 2010. Atrial fibrillation associated with aortic stenosis, MS, or mechanical valves (compared to AF without VHD) was associated with an excess absolute risk of stroke, SSE, and mortality, but anticoagulation was underused in the pre-direct oral anticoagulant (DOAC) era, highlighting need for urgent clarity regarding DOACs in AF and concomitant VHD

    Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease

    Get PDF
    Prognostic modelling is important in clinical practice and epidemiology for patient management and research. Electronic health records (EHR) provide large quantities of data for such models, but conventional epidemiological approaches require significant researcher time to implement. Expert selection of variables, fine-tuning of variable transformations and interactions, and imputing missing values are time-consuming and could bias subsequent analysis, particularly given that missingness in EHR is both high, and may carry meaning. Using a cohort of 80,000 patients from the CALIBER programme, we compared traditional modelling and machine-learning approaches in EHR. First, we used Cox models and random survival forests with and without imputation on 27 expert-selected, preprocessed variables to predict all-cause mortality. We then used Cox models, random forests and elastic net regression on an extended dataset with 586 variables to build prognostic models and identify novel prognostic factors without prior expert input. We observed that data-driven models used on an extended dataset can outperform conventional models for prognosis, without data preprocessing or imputing missing values. An elastic net Cox regression based with 586 unimputed variables with continuous values discretised achieved a C-index of 0.801 (bootstrapped 95% CI 0.799 to 0.802), compared to 0.793 (0.791 to 0.794) for a traditional Cox model comprising 27 expert-selected variables with imputation for missing values. We also found that data-driven models allow identification of novel prognostic variables; that the absence of values for particular variables carries meaning, and can have significant implications for prognosis; and that variables often have a nonlinear association with mortality, which discretised Cox models and random forests can elucidate. This demonstrates that machine-learning approaches applied to raw EHR data can be used to build models for use in research and clinical practice, and identify novel predictive variables and their effects to inform future research

    Long term health care use and costs in patients with stable coronary artery disease : a population based cohort using linked electronic health records (CALIBER)

    Get PDF
    Aims To examine long term health care utilisation and costs of patients with stable coronary artery disease (SCAD). Methods and results Linked cohort study of 94,966 patients with SCAD in England, 1st January 2001 to 31st March 2010, identified from primary care, secondary care, disease and death registries. Resource use and costs, and cost predictors by time and 5-year cardiovascular (CVD) risk profile were estimated using generalised linear models. Coronary heart disease hospitalisations were 20.5% in the first year and 66% in the year following a non-fatal (myocardial infarction, ischaemic or haemorrhagic stroke) event. Mean health care costs were £3,133 per patient in the first year and £10,377 in the year following a non-fatal event. First year predictors of cost included sex (mean cost £549 lower in females); SCAD diagnosis (NSTEMI cost £656 more than stable angina); and co-morbidities (heart failure cost £657 more per patient). Compared with lower risk patients (5-year CVD risk 3.5%), those of higher risk (5-year CVD risk 44.2%) had higher 5-year costs (£23,393 vs. £9,335) and lower lifetime costs (£43,020 vs. £116,888). Conclusion Patients with SCAD incur substantial health care utilisation and costs, which varies and may be predicted by 5-year CVD risk profile. Higher risk patients have higher initial but lower lifetime costs than lower risk patients as a result of shorter life expectancy. Improved cardiovascular survivorship among an ageing CVD population is likely to require stratified care in anticipation of the burgeoning demand

    Reproducible disease phenotyping at scale: Example of coronary artery disease in UK Biobank

    Get PDF
    IMPORTANCE: A lack of internationally agreed standards for combining available data sources at scale risks inconsistent disease phenotyping limiting research reproducibility. OBJECTIVE: To develop and then evaluate if a rules-based algorithm can identify coronary artery disease (CAD) sub-phenotypes using electronic health records (EHR) and questionnaire data from UK Biobank (UKB). DESIGN: Case-control and cohort study. SETTING: Prospective cohort study of 502K individuals aged 40-69 years recruited between 2006-2010 into the UK Biobank with linked hospitalization and mortality data and genotyping. PARTICIPANTS: We included all individuals for phenotyping into 6 predefined CAD phenotypes using hospital admission and procedure codes, mortality records and baseline survey data. Of these, 408,470 unrelated individuals of European descent had a polygenic risk score (PRS) for CAD estimated. EXPOSURE: CAD Phenotypes. MAIN OUTCOMES AND MEASURES: Association with baseline risk factors, mortality (n = 14,419 over 7.8 years median f/u), and a PRS for CAD. RESULTS: The algorithm classified individuals with CAD into prevalent MI (n = 4,900); incident MI (n = 4,621), prevalent CAD without MI (n = 10,910), incident CAD without MI (n = 8,668), prevalent self-reported MI (n = 2,754); prevalent self-reported CAD without MI (n = 5,623), yielding 37,476 individuals with any type of CAD. Risk factors were similar across the six CAD phenotypes, except for fewer men in the self-reported CAD without MI group (46.7% v 70.1% for the overall group). In age- and sex- adjusted survival analyses, mortality was highest following incident MI (HR 6.66, 95% CI 6.07-7.31) and lowest for prevalent self-reported CAD without MI at baseline (HR 1.31, 95% CI 1.15-1.50) compared to disease-free controls. There were similar graded associations across the six phenotypes per SD increase in PRS, with the strongest association for prevalent MI (OR 1.50, 95% CI 1.46-1.55) and the weakest for prevalent self-reported CAD without MI (OR 1.08, 95% CI 1.05-1.12). The algorithm is available in the open phenotype HDR UK phenotype library (https://portal.caliberresearch.org/). CONCLUSIONS: An algorithmic, EHR-based approach distinguished six phenotypes of CAD with distinct survival and PRS associations, supporting adoption of open approaches to help standardize CAD phenotyping and its wider potential value for reproducible research in other conditions

    Understanding views around the creation of a consented, donated databank of clinical free text to develop and train natural language processing models for research: focus group interviews with stakeholders

    Get PDF
    Background: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. Objective: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. Methods: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). Results: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. Conclusions: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery
    corecore