921 research outputs found

    Electronic Health Record-Derived Phenotyping Models to Improve Genomic Research in Stroke

    Get PDF
    Stroke is a highly heterogeneous and complex disease that is a leading cause of death in the United States. The landscape of risk factors for stroke is vast, and its large genetic burden has yet to be fully discovered. We hypothesize that the small number of stroke variants recovered so far is due to 1) the vast phenotypic heterogeneity of stroke and 2) binary labeling of stroke genome-wide association study (GWAS) participants as cases or controls. Specifically, genome-wide association studies accumulate hundreds of thousands to millions of participants to acquire adequate signal for variant discovery. This requires time-consuming manual curation of cases and controls often involving large-scale collaborations. Genetic biobanks connected to electronic health records (EHR) can facilitate these studies by using data routinely captured during clinical care like billing diagnosis codes. These data, however, do not define adjudicated cases and controls, with many patients falling somewhere in between. There is an opportunity to use machine learning to add nuance to these definitions. We hypothesize that an expanded definition of disease by incorporating correlated diseases and risk factors from EHR data will improve GWAS power. We also hypothesize that granularly subtyping stroke using unsupervised learning methods can provide insight into stroke etiology and heterogeneity. In Chapter 1, we described the motivation for building upon current phenotyping methods for subtyping and genome-wide association studies to improve GWAS power. In Chapter 2, using patients from Columbia-New York Presbyterian (NYP) Hospital, we built and evaluated machine learning models to identify patients with acute ischemic stroke based on 75 different case-control and classifier combinations. In chapter 3, we compared two data-driven and unsupervised methods, non-negative matrix factorization (NMF) and Hierarchical Poisson Factorization, to subtype stroke patients and determined whether any of the subtypes correlate to stroke severity. In chapter 4, we estimated the heritability of acute ischemic stroke by treating the patient probabilities assigned by the machine learning phenotyping models for acute ischemic stroke in chapter 2 as a quantitative trait and mapping the probabilities to Columbia-NYP EHR-generated pedigrees. We also applied our machine learning phenotyping algorithm method, which we call QTPhenProxy, to venous thromboembolism on Columbia eMERGE Consortium patients and ran a genome-wide association study using the model probabilities as a quantitative trait. Finally, we applied QTPhenProxy to subjects in the UK Biobank for stroke and 14 other diseases and ran genome-wide association studies for each disease. We found that our machine-learned models performed well in identifying acute ischemic stroke patients in the Columbia-NYP EHR and in the UK Biobank. We also found some NMF-derived subtypes that were significantly correlated with stroke severity. We were underpowered in the eMERGE venous thromboembolism cohort GWAS and did not recover any known or new variants. Finally, we found that QTPhenProxy improved the power of GWAS of stroke and several subtypes in the UK Biobank, recovered known variants, and discovered a new variant that replicates in a previous stroke GWAS. Our results for QTPhenProxy demonstrate the promise of incorporating large but messy sets of data, such as the electronic health record, to improve signal in genome-wide association studies

    Genetic overlap between diagnostic subtypes of ischemic stroke

    Get PDF
    Background and Purpose: Despite moderate heritability, the phenotypic heterogeneity of ischemic stroke has hampered gene discovery, motivating analyses of diagnostic subtypes with reduced sample sizes. We assessed evidence for a shared genetic basis among the 3 major subtypes: large artery atherosclerosis (LAA), cardioembolism, and small vessel disease (SVD), to inform potential cross-subtype analyses. Methods: Analyses used genome-wide summary data for 12 389 ischemic stroke cases (including 2167 LAA, 2405 cardioembolism, and 1854 SVD) and 62 004 controls from the Metastroke consortium. For 4561 cases and 7094 controls, individual-level genotype data were also available. Genetic correlations between subtypes were estimated using linear mixed models and polygenic profile scores. Meta-analysis of a combined LAA-SVD phenotype (4021 cases and 51 976 controls) was performed to identify shared risk alleles. Results: High genetic correlation was identified between LAA and SVD using linear mixed models (rg=0.96, SE=0.47, P=9×10-4) and profile scores (rg=0.72; 95% confid

    Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data

    Get PDF
    While the cost and speed of generating genomic data have come down dramatically in recent years, the slow pace of collecting medical data for large cohorts continues to hamper genetic research. Here we evaluate a novel online framework for amassing large amounts of medical information in a recontactable cohort by assessing our ability to replicate genetic associations using these data. Using web-based questionnaires, we gathered self-reported data on 50 medical phenotypes from a generally unselected cohort of over 20,000 genotyped individuals. Of a list of genetic associations curated by NHGRI, we successfully replicated about 75% of the associations that we expected to (based on the number of cases in our cohort and reported odds ratios, and excluding a set of associations with contradictory published evidence). Altogether we replicated over 180 previously reported associations, including many for type 2 diabetes, prostate cancer, cholesterol levels, and multiple sclerosis. We found significant variation across categories of conditions in the percentage of expected associations that we were able to replicate, which may reflect systematic inflation of the effects in some initial reports, or differences across diseases in the likelihood of misdiagnosis or misreport. We also demonstrated that we could improve replication success by taking advantage of our recontactable cohort, offering more in-depth questions to refine self-reported diagnoses. Our data suggests that online collection of self-reported data in a recontactable cohort may be a viable method for both broad and deep phenotyping in large populations

    A genome-wide association study of outcome from traumatic brain injury

    Get PDF
    Background Factors such as age, pre-injury health, and injury severity, account for less than 35% of outcome variability in traumatic brain injury (TBI). While some residual outcome variability may be attributable to genetic factors, published candidate gene association studies have often been underpowered and subject to publication bias.& nbsp;Methods We performed the first genome-and transcriptome-wide association studies (GWAS, TWAS) of genetic effects on outcome in TBI. The study population consisted of 5268 patients from prospective European and US studies, who attended hospital within 24 h of TBI, and satisfied local protocols for computed tomography.& nbsp;Findings The estimated heritability of TBI outcome was 0.26. GWAS revealed no genetic variants with genome-wide significance (p < 5 x 10(-8)), but identified 83 variants in 13 independent loci which met a lower pre-specified sub-genomic statistical threshold (p < 10(-5)). Similarly, none of the genes tested in TWAS met tissue-wide significance. An exploratory analysis of 75 published candidate variants associated with 28 genes revealed one replicable variant (rs1800450 in the MBL2 gene) which retained significance after correction for multiple comparison (p = 5.24 x 10(-4)).& nbsp;Interpretation While multiple novel loci reached less stringent thresholds, none achieved genome-wide significance. The overall heritability estimate, however, is consistent with the hypothesis that common genetic variation substantially contributes to inter-individual variability in TBI outcome. The meta-analytic approach to the GWAS and the availability of summary data allows for a continuous extension with additional cohorts as data becomes available.& nbsp;Copyright (C)& nbsp;2022 Published by Elsevier B.V.Peer reviewe

    Explaining additional genetic variation in complex traits

    Get PDF
    Genome-wide association studies (GWAS) have provided valuable insights into the genetic basis of complex traits, discovering >6000 variants associated with >500 quantitative traits and common complex diseases in humans. The associations identified so far represent only a fraction of those that influence phenotype, because there are likely to be many variants across the entire frequency spectrum, each of which influences multiple traits, with only a small average contribution to the phenotypic variance. This presents a considerable challenge to further dissection of the remaining unexplained genetic variance within populations, which limits our ability to predict disease risk, identify new drug targets, improve and maintain food sources, and understand natural diversity. This challenge will be met within the current framework through larger sample size, better phenotyping, including recording of nongenetic risk factors, focused study designs, and an integration of multiple sources of phenotypic and genetic information. The current evidence supports the application of quantitative genetic approaches, and we argue that one should retain simpler theories until simplicity can be traded for greater explanatory power

    THE ROLE OF DNA METHYLATION IN WHITE MATTER HYPERINTENSITY BURDEN: AN INTEGRATIVE APPROACH

    Get PDF
    Cerebral white matter hyperintensities (WMH) on MRI are an indicator of cerebral small vessel disease, a major risk factor for vascular dementia and stroke. DNA methylation may contribute to the molecular underpinnings of WMH, which are highly heritable. We performed a meta-analysis of 11 epigenome-wide association studies in 6,019 middle-aged to elderly subjects, who were free of dementia and stroke and were of African (AA) or European (EA) descent. In each study, association between WMH volume and each CpG was tested within ancestry using a linear mixed model, adjusted for age, sex, total intracranial volume, white blood cell count, technical covariates, BMI, smoking and blood pressure (BP). To detect differentially methylated regions (DMRs), we also calculated region-based p values accounting for spatial correlations among CpGs. No individual CpG reached epigenome-wide significance, but suggestive novel associations were identified with cg17577122 (CLDN5, P=2.39E-7), cg24202936 (LOC441601, P=3.78E-7), cg03116124 (TRIM67, P=6.55E-7), cg04245766 (BMP4, P=3.78E-7) and cg06809326 (CCDC144NL, P=6.14E-7). Gene enrichment analyses implicated pathways involved in regulation of cell development and differentiation, especially of endothelial cells. We identified 11 DMRs (PSidak\u3c0.05) and two were mapped to BP-related genes (HIVEP3, TCEA2). The most significant DMRs were mapped to PRMT1, a protein arginine methyltransferase involved in glioblastomagenesis (P=7.9E-12), and mapped to CCDC144NL-AS1, an antisense transcript of CCDC144NL (P=1.6E-11). Genes mapping to DMRs were enriched in biological processes related to lipoprotein metabolism and transport. Bi-directional Mendelian randomization analysis showed that DNA methylation level at cg06809326 influenced WMH burden (OR [95% CI] = 1.7[1.2-2.5], P=0.001) but not the reverse (P=0.89). Additionally, increased methylation at cg06809326 was associated with lower expression of CCDC144NL (P=3.3E-2), and two-step Mendelian randomization analysis supported its mediating role in the association of cg06809326 and WMH burden. CCDC144NL is known to be associated with diabetic retinopathy and the coiled coil proteins in general promotes integrin-dependent cell adhesion. Integrin-related pathway was further supported by integrative genetic analyses. In conclusion, we identified novel epigenetic loci associated with WMH burden, and further supported the role of cg06809326 in the WMH etiology implicating integrin-mediated pathology

    Genetic approaches to studying complex human disease

    Get PDF
    Common, complex diseases such as cardiovascular disease (CVD) represent an intricate interaction between environmental and genetic factors and now account for the leading causes of mortality in western society. By investigating the genetic component of complex disease etiology, we have gained a better understanding of the biological pathways underlying complex disease and the heterogeneity of complex disease risk. However, the development of high throughput genomic technologies and large well-phenotyped multi-ethnic cohorts has opened the door towards more in-depth and trans-disciplinary approaches to studying the genetics of complex disease pathogenesis. Accordingly, we sought to investigate select complex traits and diseases using both established and novel genomic technologies, including candidate gene resequencing, high-throughput targeted microarray genotyping and candidate variant genotyping. We demonstrate that a private and common variant, p.G116S, within the low-density lipoprotein receptor (LDLR) gene among Inuit descendants has a large effect on plasma cholesterol; that variation in cardio-metabolic and Alzheimer disease (AD) loci is not associated with susceptibility to the pre-dementia phenotype known as “cognitive impairment, no dementia”; and that established type 2 diabetes (T2D) variants are not associated with T2D susceptibility among select aboriginal Canadian and Greenland cohorts. Together, these studies represent a selection of established and novel genomic strategies for the investigation of complex disease genetics which are likely to remain fundamental in the continued investigation of complex disease pathogenesis

    Genetics of Diabetes Subtypes. Characterization of novel cluster-based diabetes subtypes.

    Get PDF
    BACKGROUND: Type 2 diabetes (T2D) has been reproducibly clustered into five subtypes based on six-clinical variables; age at diabetes onset, body mass index (BMI), Glutamic acid decarboxylase autoantibodies (GADA), glycated hemoglobin (HbA1c) and insulin secretion and resistance estimated as HOMA2B and HOMA2IR derived from fasting glucose and Cpeptide. These subtypes have different disease progression and risk of complications. The newly defined subtypes are called Severe Autoimmune Diabetes (SAID), Severe Insulin Deficient Diabetes (SIDD), Severe InsulinResistant Diabetes (SIRD), Mild Obesity-related Diabetes (MOD), and Mild Age-Related Diabetes (MARD). AIM: The main aim of the thesis was to characterize the subtypes using genetics and biomarkers to investigate potential etiological differences, identify subtype-specific genetic associations and determine the underlying mechanisms of kidney complications in the subtypes.METHODS: The project included individuals with diabetes (cases) from the Swedish cohort All New Diabetics In Scania (ANDIS, n=10927) and the Finnish cohort Diabetes Registry Vasa (DIREVA, n=4754) as well as diabetes-free individuals (controls) from the Swedish Malmö Diet and Cancer cohort (MDC,n=2744) and the Finnish Botnia cohort (n=1683). Clusters defined in Ahlqvist et al, 2018, were used for all analyses. The number of individuals in the subtypes were as follows: SAID (n=452, n=327), SIDD (n=1193, n=394), SIRD (n=1130, n=453), MOD (n=1374, n=596) and MARD (n=2861, n=1178), in ANDIS and DIREVA respectively. In Paper I and III, genome-wide association studies (GWAS) and genetic risk score (GRS) analyses were performed to compare underlying genetic drivers in the Swedish cohorts and replicated in the Finnish cohorts. In Paper III, the primary phenotype was estimated glomerular filtration rate (eGRF) reflecting chronic kidney disease. In Paper II, epidemiological and genetic analysis was performed using clustering, Cox regression models and GRS to compare GADA negative individuals with diabetes of Iraqi (n=286) and Swedish origin (n=10641) with respect to new diabetes subclassification and complications. In Paper IV, the proteomic profiles of the subtypes were studied using 1161 biomarkers measured on Olink panels. Machine learning algorithms were applied to prioritize biomarkers, followed by Menedelian Randomization. RESULTS: In Paper I, the HLA rs9273368 variant was significantly associated with SAID (OR=2.89,P=6.5x10-40), the TCF7L2 rs7903146 variant was significantly associated with SIDD (OR=1.56, P=8.6x10-15), MOD (OR=1.40, P=3.1x10-10) and MARD (OR=1.42,P=6.1x10-16). The rs10824307 variant near the LRMDA gene was uniquely associated with MOD (OR=1.35, P=1.3×10-09). GRS for fasting insulin showed a unique association with SIRD (OR=1.855, P=5.91x10-09). GRSs for BMI were associated with SIDD, SIRD and MOD but not MARD (OR=1.046, P=0.099). Paper II concluded thar individuals with diabetes from Iraq present with a more insulin-deficient subtype than native Swedes. They have a higher risk of coronary events but a lower risk of CKD. In Paper III, in ANDIS, eGFR was strongly associated with the A allele of rs77924615 in the well-established PDILT-UMOD locus (beta=0.126, p=6.61x10-13) in all T2D; MARD and SIDD but not in MOD or SIRD (p>0.05). In the SIRD subtype, eGFR was associated with the C allele of rs3770382 in the CTNNA2 gene at near genomewide significance (beta=-0.219, p=5.5x10- 08), but was not associated in any of the other subtypes. In DIREVA, the PDILTUMOD locus replicated in T2D, MARD, and SIDD, and was also associated in SIRD (beta=0.24, p=0.001) but not in MOD (beta=0.076, p=0.109). The CTNNA2 locus did not replicate in DIREVA. Paper IV, the diabetes subtypes were shown to have different proteomic profiles and a list of prioritized biomarkers was generated for future follow-up. CONCLUSION: The newly defined subtypes are partially distinct with genetically different backgrounds and SIRD is suggested to have more beta-cell independent pathogenesis. There is some suggestive support for different genetic backgrounds of DKD in diabetes subtypes. Biomarkers could be valuable for better discrimination of subtypes and cross cohort comparisons in larger datasets. The diabetes subclassification approach paves the way for individualized patient management and the development of new therapeutic targets
    corecore