426 research outputs found
Recommended from our members
Electronic Health Record-Derived Phenotyping Models to Improve Genomic Research in Stroke
Stroke is a highly heterogeneous and complex disease that is a leading cause of death in the United States. The landscape of risk factors for stroke is vast, and its large genetic burden has yet to be fully discovered. We hypothesize that the small number of stroke variants recovered so far is due to 1) the vast phenotypic heterogeneity of stroke and 2) binary labeling of stroke genome-wide association study (GWAS) participants as cases or controls. Specifically, genome-wide association studies accumulate hundreds of thousands to millions of participants to acquire adequate signal for variant discovery. This requires time-consuming manual curation of cases and controls often involving large-scale collaborations. Genetic biobanks connected to electronic health records (EHR) can facilitate these studies by using data routinely captured during clinical care like billing diagnosis codes. These data, however, do not define adjudicated cases and controls, with many patients falling somewhere in between. There is an opportunity to use machine learning to add nuance to these definitions. We hypothesize that an expanded definition of disease by incorporating correlated diseases and risk factors from EHR data will improve GWAS power. We also hypothesize that granularly subtyping stroke using unsupervised learning methods can provide insight into stroke etiology and heterogeneity. In Chapter 1, we described the motivation for building upon current phenotyping methods for subtyping and genome-wide association studies to improve GWAS power. In Chapter 2, using patients from Columbia-New York Presbyterian (NYP) Hospital, we built and evaluated machine learning models to identify patients with acute ischemic stroke based on 75 different case-control and classifier combinations. In chapter 3, we compared two data-driven and unsupervised methods, non-negative matrix factorization (NMF) and Hierarchical Poisson Factorization, to subtype stroke patients and determined whether any of the subtypes correlate to stroke severity. In chapter 4, we estimated the heritability of acute ischemic stroke by treating the patient probabilities assigned by the machine learning phenotyping models for acute ischemic stroke in chapter 2 as a quantitative trait and mapping the probabilities to Columbia-NYP EHR-generated pedigrees. We also applied our machine learning phenotyping algorithm method, which we call QTPhenProxy, to venous thromboembolism on Columbia eMERGE Consortium patients and ran a genome-wide association study using the model probabilities as a quantitative trait. Finally, we applied QTPhenProxy to subjects in the UK Biobank for stroke and 14 other diseases and ran genome-wide association studies for each disease. We found that our machine-learned models performed well in identifying acute ischemic stroke patients in the Columbia-NYP EHR and in the UK Biobank. We also found some NMF-derived subtypes that were significantly correlated with stroke severity. We were underpowered in the eMERGE venous thromboembolism cohort GWAS and did not recover any known or new variants. Finally, we found that QTPhenProxy improved the power of GWAS of stroke and several subtypes in the UK Biobank, recovered known variants, and discovered a new variant that replicates in a previous stroke GWAS. Our results for QTPhenProxy demonstrate the promise of incorporating large but messy sets of data, such as the electronic health record, to improve signal in genome-wide association studies
Automatic classification of registered clinical trials towards the Global Burden of Diseases taxonomy of diseases and injuries
Includes details on the implementation of MetaMap and IntraMap, prioritization rules, the test set of clinical trials and the classification of the external test set according to the 171 GBD categories. Dataset S1: Expert-based enrichment database for the classification according to the 28 GBD categories. Manual classification of 503 UMLS concepts that could not be mapped to any of the 28 GBD categories. Dataset S2: Expert-based enrichment database for the classification according to the 171 GBD categories. Manual classification of 655 UMLS concepts that could not be mapped to any of the 171 GBD categories, among which 108 could be projected to candidate GBD categories. Table S1: Excluded residual GBD categories for the grouping of the GBD cause list in 171 GBD categories. A grouping of 193 GBD categories was defined during the GBD 2010 study to inform policy makers about the main health problems per country. From these 193 GBD categories, we excluded the 22 residual categories listed in the Table. We developed a classifier for the remaining 171 GBD categories. Among these residual categories, the unique excluded categories in the grouping of 28 GBD categories were “Other infectious diseases” and “Other endocrine, nutritional, blood, and immune disorders”. Table S2: Per-category evaluation of performance of the classifier for the 171 GBD categories plus the “No GBD” category. Number of trials per GBD category from the test set of 2,763 clinical trials. Sensitivities, specificities (in %) and likelihood ratios for each of the 171 GBD categories plus the “No GBD” category for the classifier using the Word Sense Disambiguation server, the expert-based enrichment database and the priority to the health condition field. Table S3: Performance of the 8 versions of the classifier for the 171 GBD categories. Exact-matching and weighted averaged sensitivities and specificities for 8 versions of the classifier for the 171 GBD categories. Exact-matching corresponds to the proportion (in %) of trials for which the automatic GBD classification is correct. Exact-matching was estimated over all trials (N = 2,763), trials concerning a unique GBD category (N = 2,092), trials concerning 2 or more GBD categories (N = 187), and trials not relevant for the GBD (N = 484). The weighted averaged sensitivity and specificity corresponds to the weighted average across GBD categories of the sensitivities and specificities for each GBD category plus the “No GBD” category (in %). The 8 versions correspond to the combinations of the use or not of the Word Sense Disambiguation server during the text annotation, the expert-based enrichment database, and the priority to the health condition field as a prioritization rule. Table S4: Per-category evaluation of the performance of the baseline for the 28 GBD categories plus the “No GBD” category. Number of trials per GBD category from the test set of 2,763 clinical trials. Sensitivities and specificities (in %) of the 28 GBD categories plus the “No GBD” category for the classification of clinical trial records towards GBD categories without using the UMLS knowledge source but based on the recognition in free text of the names of diseases defining in each GBD category only. For the baseline a clinical trial records was classified with a GBD category if at least one of the 291 disease names from the GBD cause list defining that GBD category appeared verbatim in the condition field, the public or scientific titles, separately, or in at least one of these three text fields. (DOCX 84 kb
Recommended from our members
Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods
Background
Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.
Materials and methods
Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.
Results
Across all models, we found that the mean AUROC for detecting AIS was 0.963 ± 0.0520 and average precision score 0.790 ± 0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832 ± 0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60–150 fold over expected).
Conclusions
Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models
PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data
With calls for increasing transparency, governments are releasing greater
amounts of data in multiple domains including finance, education and
healthcare. The efficient exploratory analysis of healthcare data constitutes a
significant challenge. Key concerns in public health include the quick
identification and analysis of trends, and the detection of outliers. This
allows policies to be rapidly adapted to changing circumstances. We present an
efficient outlier detection technique, termed PIKS (Pruned iterative-k means
searchlight), which combines an iterative k-means algorithm with a pruned
searchlight based scan. We apply this technique to identify outliers in two
publicly available healthcare datasets from the New York Statewide Planning and
Research Cooperative System, and California's Office of Statewide Health
Planning and Development. We provide a comparison of our technique with three
other existing outlier detection techniques, consisting of auto-encoders,
isolation forests and feature bagging. We identified outliers in conditions
including suicide rates, immunity disorders, social admissions,
cardiomyopathies, and pregnancy in the third trimester. We demonstrate that the
PIKS technique produces results consistent with other techniques such as the
auto-encoder. However, the auto-encoder needs to be trained, which requires
several parameters to be tuned. In comparison, the PIKS technique has far fewer
parameters to tune. This makes it advantageous for fast, "out-of-the-box" data
exploration. The PIKS technique is scalable and can readily ingest new
datasets. Hence, it can provide valuable, up-to-date insights to citizens,
patients and policy-makers. We have made our code open source, and with the
availability of open data, other researchers can easily reproduce and extend
our work. This will help promote a deeper understanding of healthcare policies
and public health issues
Exploiting electronic health records for research on atrial fibrillation: risk factors, subtypes, and outcomes
BACKGROUND: Electronic health records (EHRs), collected on large populations in routine clinical care, may hold novel insights into the heart rhythm disorder atrial fibrillation (AF). AIM: To exploit EHRs to investigate, validate and extend evidence for AF risk factors, subtypes, and outcomes. METHODS: The CALIBER dataset (1997–2010) linking primary care, secondary care, and mortality records for a representative subset of the UK population was used (i) to model associations between cardiovascular disease (CVD) risk factors and incident AF, including AF with (AF+) and AF without (AF–) intercurrent CVD, (ii) to create EHR definitions for eight AF subtypes (structural, focal, polygenic, postoperative, valvular, monogenic, respiratory and AF in athletes) and (iii) to investigate stroke outcomes by CHA2DS2-VASc, sex, and warfarin use. RESULTS: Among 1,949,052 individuals, 50,097 developed incident AF: 12,652 (25.3%) with AF+ and 37,445 (74.7%) with AF–. Smoking (HR [95%CI] for AF+ vs. AF–: 1.66 [1.56,1.77] vs. 1.21 [1.16,1.25]), hypertension (2.19 [2.11,2.27] vs. 1.65 [1.62,1.69]), and diabetes (2.03 [1.94,2.12] vs. 1.45 [1.41,1.49]) showed consistent direct associations with AF+ and AF–, while heavy drinking (1.17 [0.81,1.67] vs. 1.99 [1.68,2.34]) and total cholesterol levels (0.99 [0.96,1.02] vs. 0.85 [0.84,0.87]) showed inconsistent associations with AF+ and AF–. EHR definitions for AF subtypes were created by combining 2813 diagnosis, medication, and procedure codes. There were 12,751 individuals with AF and valvular heart disease. Prosthetic replacements, mitral stenosis and aortic stenosis showed higher HR [95%CI] for stroke, thromboembolism and mortality (1.13 [1.02,1.24], 1.20 [1.05,1.36], and 1.27 [1.19,1.37] respectively). The net-clinical benefit (NCB [95%CI] per 100 person-years) of warfarin was shown from CHA2DS2-VASc≥2 in men (0.5 [0.1,0.9]) and CHA2DS2-VASc≥3 in women (1.5 [1.1,1.9]). CONCLUSION: AF is a heterogeneous condition associated with diverse disease mechanisms. EHRs can help refine understanding of risk factors, subtypes, and outcomes with relevance for clinical practice
Medical Informatics
Information technology has been revolutionizing the everyday life of the common man, while medical science has been making rapid strides in understanding disease mechanisms, developing diagnostic techniques and effecting successful treatment regimen, even for those cases which would have been classified as a poor prognosis a decade earlier. The confluence of information technology and biomedicine has brought into its ambit additional dimensions of computerized databases for patient conditions, revolutionizing the way health care and patient information is recorded, processed, interpreted and utilized for improving the quality of life. This book consists of seven chapters dealing with the three primary issues of medical information acquisition from a patient's and health care professional's perspective, translational approaches from a researcher's point of view, and finally the application potential as required by the clinicians/physician. The book covers modern issues in Information Technology, Bioinformatics Methods and Clinical Applications. The chapters describe the basic process of acquisition of information in a health system, recent technological developments in biomedicine and the realistic evaluation of medical informatics
Investigating penetrance of rare genetic variants using population cohorts
The same genetic variant found in different individuals can cause a spectrum of phenotypes, with some individuals showing no signs of any clinical illness, and some displaying severe illness. Variants that cause this can be said to show incomplete penetrance, where the related genotype either causes clinical disease or not, or they can be said to display variable expressivity, in which the clinical symptoms can vary across a spectrum. Incomplete penetrance and variable expressivity are both thought to be influenced by a large number of factors, including genetic modifiers, epigenetics, and environmental factors.
Many thousands of genetic variants have been identified as causal of monogenic disorders, mostly determined through small clinical studies, and thus the penetrance and expressivity of these variants may be overestimated when compared to their effect in the general population. With the wealth of population cohort data currently available, the penetrance and expressivity of such genetic variants can be investigated across a much wider contingent, potentially helping to reclassify variants that were previously thought to be completely penetrant.
This thesis aims to investigate the penetrance and expressivity of rare genetic variants in large population cohorts, and to potentially identify any genetic modifiers that could also affect the phenotypic effect of these variants, including the presence of other rare variants, and the aggregation of small effect common variants. We show that putatively damaging variants in a large number of genes are present at a higher rate than previously expected in healthy population cohorts. Furthermore, we show that as an aggregate, individuals who carry one of these variants have sub-clinical phenotypes related to the traits seen in clinical disease cases with variants in similar genes. We also show that the penetrance and expressivity of these rare variants can be modified by the presence of other rare variants in similar genes, and through common genetic variant, aggregated as polygenic scores. We then investigate methods of identifying rare non-coding variants that could be potential genetic modifiers
Predicting a diagnosis of ankylosing spondylitis using primary care health records–A machine learning approach
Ankylosing spondylitis is the second most common cause of inflammatory arthritis. However, a successful diagnosis can take a decade to confirm from symptom onset (via x-rays). The aim of this study was to use machine learning methods to develop a profile of the characteristics of people who are likely to be given a diagnosis of AS in future. The Secure Anonymised Information Linkage databank was used. Patients with ankylosing spondylitis were identified using their routine data and matched with controls who had no record of a diagnosis of ankylosing spondylitis or axial spondyloarthritis. Data was analysed separately for men and women. The model was developed using feature/variable selection and principal component analysis to develop decision trees. The decision tree with the highest average F value was selected and validated with a test dataset. The model for men indicated that lower back pain, uveitis, and NSAID use under age 20 is associated with AS development. The model for women showed an older age of symptom presentation compared to men with back pain and multiple pain relief medications. The models showed good prediction (positive predictive value 70%-80%) in test data but in the general population where prevalence is very low (0.09% of the population in this dataset) the positive predictive value would be very low (0.33%-0.25%). Machine learning can be used to help profile and understand the characteristics of people who will develop AS, and in test datasets with artificially high prevalence, will perform well. However, when applied to a general population with low prevalence rates, such as that in primary care, the positive predictive value for even the best model would be 1.4%. Multiple models may be needed to narrow down the population over time to improve the predictive value and therefore reduce the time to diagnosis of ankylosing spondylitis
- …