5,847 research outputs found

    Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

    Get PDF
    Novel approaches that complement and go beyond evidence-based medicine are required in the domain of chronic diseases, given the growing incidence of such conditions on the worldwide population. A promising avenue is the secondary use of electronic health records (EHRs), where patient data are analyzed to conduct clinical and translational research. Methods based on machine learning to process EHRs are resulting in improved understanding of patient clinical trajectories and chronic disease risk prediction, creating a unique opportunity to derive previously unknown clinical insights. However, a wealth of clinical histories remains locked behind clinical narratives in free-form text. Consequently, unlocking the full potential of EHR data is contingent on the development of natural language processing (NLP) methods to automatically transform clinical text into structured clinical data that can guide clinical decisions and potentially delay or prevent disease onset

    Exploring the Danish Diseasome

    Get PDF

    Machine Learning for Diabetes and Mortality Risk Prediction From Electronic Health Records

    Get PDF
    Data science can provide invaluable tools to better exploit healthcare data to improve patient outcomes and increase cost-effectiveness. Today, electronic health records (EHR) systems provide a fascinating array of data that data science applications can use to revolutionise the healthcare industry. Utilising EHR data to improve the early diagnosis of a variety of medical conditions/events is a rapidly developing area that, if successful, can help to improve healthcare services across the board. Specifically, as Type-2 Diabetes Mellitus (T2DM) represents one of the most serious threats to health across the globe, analysing the huge volumes of data provided by EHR systems to investigate approaches for early accurately predicting the onset of T2DM, and medical events such as in-hospital mortality, are two of the most important challenges data science currently faces. The present thesis addresses these challenges by examining the research gaps in the existing literature, pinpointing the un-investigated areas, and proposing a novel machine learning modelling given the difficulties inherent in EHR data. To achieve these aims, the present thesis firstly introduces a unique and large EHR dataset collected from Saudi Arabia. Then we investigate the use of a state-of-the-art machine learning predictive models that exploits this dataset for diabetes diagnosis and the early identification of patients with pre-diabetes by predicting the blood levels of one of the main indicators of diabetes and pre-diabetes: elevated Glycated Haemoglobin (HbA1c) levels. A novel collaborative denoising autoencoder (Col-DAE) framework is adopted to predict the diabetes (high) HbA1c levels. We also employ several machine learning approaches (random forest, logistic regression, support vector machine, and multilayer perceptron) for the identification of patients with pre-diabetes (elevated HbA1c levels). The models employed demonstrate that a patient's risk of diabetes/pre-diabetes can be reliably predicted from EHR records. We then extend this work to include pioneering adoption of recent technologies to investigate the outcomes of the predictive models employed by using recent explainable methods. This work also investigates the effect of using longitudinal data and more of the features available in the EHR systems on the performance and features ranking of the employed machine learning models for predicting elevated HbA1c levels in non-diabetic patients. This work demonstrates that longitudinal data and available EHR features can improve the performance of the machine learning models and can affect the relative order of importance of the features. Secondly, we develop a machine learning model for the early and accurate prediction all in-hospital mortality events for such patients utilising EHR data. This work investigates a novel application of the Stacked Denoising Autoencoder (SDA) to predict in-hospital patient mortality risk. In doing so, we demonstrate how our approach uniquely overcomes the issues associated with imbalanced datasets to which existing solutions are subject. The proposed model –– using clinical patient data on a variety of health conditions and without intensive feature engineering –– is demonstrated to achieve robust and promising results using EHR patient data recorded during the first 24 hours after admission

    Impact of Terminology Mapping on Population Health Cohorts IMPaCt

    Get PDF
    Background and Objectives: The population health care delivery model uses phenotype algorithms in the electronic health record (EHR) system to identify patient cohorts targeted for clinical interventions such as laboratory tests, and procedures. The standard terminology used to identify disease cohorts may contribute to significant variation in error rates for patient inclusion or exclusion. The United States requires EHR systems to support two diagnosis terminologies, the International Classification of Disease (ICD) and the Systematized Nomenclature of Medicine (SNOMED). Terminology mapping enables the retrieval of diagnosis data using either terminology. There are no standards of practice by which to evaluate and report the operational characteristics of ICD and SNOMED value sets used to select patient groups for population health interventions. Establishing a best practice for terminology selection is a step forward in ensuring that the right patients receive the right intervention at the right time. The research question is, “How does the diagnosis retrieval terminology (ICD vs SNOMED) and terminology map maintenance impact population health cohorts?” Aim 1 and 2 explore this question, and Aim 3 informs practice and policy for population health programs. Methods Aim 1: Quantify impact of terminology choice (ICD vs SNOMED) ICD and SNOMED phenotype algorithms for diabetes, chronic kidney disease (CKD), and heart failure were developed using matched sets of codes from the Value Set Authority Center. The performance of the diagnosis-only phenotypes was compared to published reference standard that included diagnosis codes, laboratory results, procedures, and medications. Aim 2: Measure terminology maintenance impact on SNOMED cohorts For each disease state, the performance of a single SNOMED algorithm before and after terminology updates was evaluated in comparison to a reference standard to identify and quantify cohort changes introduced by terminology maintenance. Aim 3: Recommend methods for improving population health interventions The socio-technical model for studying health information technology was used to inform best practice for the use of population health interventions. Results Aim 1: ICD-10 value sets had better sensitivity than SNOMED for diabetes (.829, .662) and CKD (.242, .225) (N=201,713, p Aim 2: Following terminology maintenance the SNOMED algorithm for diabetes increased in sensitivity from (.662 to .683 (p Aim 3: Based on observed social and technical challenges to population health programs, including and in addition to the development and measurement of phenotypes, a practical method was proposed for population health intervention development and reporting

    Identifying and mitigating biases in EHR laboratory tests

    Get PDF
    AbstractElectronic health record (EHR) data show promise for deriving new ways of modeling human disease states. Although EHR researchers often use numerical values of laboratory tests as features in disease models, a great deal of information is contained in the context within which a laboratory test is taken. For example, the same numerical value of a creatinine test has different interpretation for a chronic kidney disease patient and a patient with acute kidney injury. We study whether EHR research studies are subject to biased results and interpretations if laboratory measurements taken in different contexts are not explicitly separated. We show that the context of a laboratory test measurement can often be captured by the way the test is measured through time.We perform three tasks to study the properties of these temporal measurement patterns. In the first task, we confirm that laboratory test measurement patterns provide additional information to the stand-alone numerical value. The second task identifies three measurement pattern motifs across a set of 70 laboratory tests performed for over 14,000 patients. Of these, one motif exhibits properties that can lead to biased research results. In the third task, we demonstrate the potential for biased results on a specific example. We conduct an association study of lipase test values to acute pancreatitis. We observe a diluted signal when using only a lipase value threshold, whereas the full association is recovered when properly accounting for lipase measurements in different contexts (leveraging the lipase measurement patterns to separate the contexts).Aggregating EHR data without separating distinct laboratory test measurement patterns can intermix patients with different diseases, leading to the confounding of signals in large-scale EHR analyses. This paper presents a methodology for leveraging measurement frequency to identify and reduce laboratory test biases

    Exploiting electronic health records for research on atrial fibrillation: risk factors, subtypes, and outcomes

    Get PDF
    BACKGROUND: Electronic health records (EHRs), collected on large populations in routine clinical care, may hold novel insights into the heart rhythm disorder atrial fibrillation (AF). AIM: To exploit EHRs to investigate, validate and extend evidence for AF risk factors, subtypes, and outcomes. METHODS: The CALIBER dataset (1997–2010) linking primary care, secondary care, and mortality records for a representative subset of the UK population was used (i) to model associations between cardiovascular disease (CVD) risk factors and incident AF, including AF with (AF+) and AF without (AF–) intercurrent CVD, (ii) to create EHR definitions for eight AF subtypes (structural, focal, polygenic, postoperative, valvular, monogenic, respiratory and AF in athletes) and (iii) to investigate stroke outcomes by CHA2DS2-VASc, sex, and warfarin use. RESULTS: Among 1,949,052 individuals, 50,097 developed incident AF: 12,652 (25.3%) with AF+ and 37,445 (74.7%) with AF–. Smoking (HR [95%CI] for AF+ vs. AF–: 1.66 [1.56,1.77] vs. 1.21 [1.16,1.25]), hypertension (2.19 [2.11,2.27] vs. 1.65 [1.62,1.69]), and diabetes (2.03 [1.94,2.12] vs. 1.45 [1.41,1.49]) showed consistent direct associations with AF+ and AF–, while heavy drinking (1.17 [0.81,1.67] vs. 1.99 [1.68,2.34]) and total cholesterol levels (0.99 [0.96,1.02] vs. 0.85 [0.84,0.87]) showed inconsistent associations with AF+ and AF–. EHR definitions for AF subtypes were created by combining 2813 diagnosis, medication, and procedure codes. There were 12,751 individuals with AF and valvular heart disease. Prosthetic replacements, mitral stenosis and aortic stenosis showed higher HR [95%CI] for stroke, thromboembolism and mortality (1.13 [1.02,1.24], 1.20 [1.05,1.36], and 1.27 [1.19,1.37] respectively). The net-clinical benefit (NCB [95%CI] per 100 person-years) of warfarin was shown from CHA2DS2-VASc≥2 in men (0.5 [0.1,0.9]) and CHA2DS2-VASc≥3 in women (1.5 [1.1,1.9]). CONCLUSION: AF is a heterogeneous condition associated with diverse disease mechanisms. EHRs can help refine understanding of risk factors, subtypes, and outcomes with relevance for clinical practice

    Patient safety in English general practice : the role of routinely collected data in detecting adverse events

    Get PDF
    The use of routinely collected, or administrative, data for measuring and monitoring patient safety in primary care is a relatively new phenomenon. With increasing availability of data from different sources and care settings, their application for adverse event surveillance needs evaluation. In this thesis, I demonstrated that data routinely collected from primary care and secondary care can be applied for internal monitoring of adverse events at the general practice-level in England, but these data currently have limited use for safety benchmarking in primary care. To support this statement, multiple approaches were adopted. In the first part of the thesis, the nature and scope of patient safety issues in general practice were defined by evidence from a literature review and informal consultations with general practitioners (GPs). Secondly, using these two methods, measures of adverse events based on routinely collected healthcare data were identified. Thirdly, clinical consensus guided the selection of three candidate patient safety indicators for investigation; the safety issues explored in this thesis were recorded incidents with designated adverse event diagnostic codes and complications associated with two common diseases, emergency admissions for diabetic hyperglycaemic emergencies (diabetic ketoacidosis, DKA and hyperglycaemic hyperosmolar state, HHS) and cancer. In the second part of the thesis, the contributions of routinely collected data to new knowledge about potentially preventable adverse events in England were considered. Data from a primary care trust (NHS Brent), national primary care data (from the General Practice Research Database, GPRD) and secondary care data (Hospital Episode Statistics, HES) were used to explore the epidemiology of, and patient characteristics associated with, coded adverse events and emergency admissions for diabetic hyperglycaemic emergencies and cancer. Low rates of adverse events were found, with variation by individual patient factors. Finally, recommendations were made on extending the uses of routinely collected data for patient safety monitoring in general practice
    • …
    corecore