58 research outputs found

    Development and Validation of ML-DQA -- a Machine Learning Data Quality Assurance Framework for Healthcare

    Full text link
    The approaches by which the machine learning and clinical research communities utilize real world data (RWD), including data captured in the electronic health record (EHR), vary dramatically. While clinical researchers cautiously use RWD for clinical investigations, ML for healthcare teams consume public datasets with minimal scrutiny to develop new algorithms. This study bridges this gap by developing and validating ML-DQA, a data quality assurance framework grounded in RWD best practices. The ML-DQA framework is applied to five ML projects across two geographies, different medical conditions, and different cohorts. A total of 2,999 quality checks and 24 quality reports were generated on RWD gathered on 247,536 patients across the five projects. Five generalizable practices emerge: all projects used a similar method to group redundant data element representations; all projects used automated utilities to build diagnosis and medication data elements; all projects used a common library of rules-based transformations; all projects used a unified approach to assign data quality checks to data elements; and all projects used a similar approach to clinical adjudication. An average of 5.8 individuals, including clinicians, data scientists, and trainees, were involved in implementing ML-DQA for each project and an average of 23.4 data elements per project were either transformed or removed in response to ML-DQA. This study demonstrates the importance role of ML-DQA in healthcare projects and provides teams a framework to conduct these essential activities.Comment: Presented at 2022 Machine Learning in Health Care Conferenc

    Epigenome-wide association study of serum urate reveals insights into urate co-regulation and the SLC2A9 locus

    Get PDF
    Elevated serum urate levels, a complex trait and major risk factor for incident gout, are correlated with cardiometabolic traits via incompletely understood mechanisms. DNA methylation in whole blood captures genetic and environmental influences and is assessed in transethnic meta-analysis of epigenome-wide association studies (EWAS) of serum urate (discovery, n = 12,474, replication, n = 5522). The 100 replicated, epigenome-wide significant (p < 1.1E–7) CpGs explain 11.6% of the serum urate variance. At SLC2A9, the serum urate locus with the largest effect in genome-wide association studies (GWAS), five CpGs are associated with SLC2A9 gene expression. Four CpGs at SLC2A9 have significant causal effects on serum urate levels and/or gout, and two of these partly mediate the effects of urate-associated GWAS variants. In other genes, including SLC7A11 and PHGDH, 17 urate-associated CpGs are associated with conditions defining metabolic syndrome, suggesting that these CpGs may represent a blood DNA methylation signature of cardiometabolic risk factors. This study demonstrates that EWAS can provide new insights into GWAS loci and the correlation of serum urate with other complex traits

    Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects

    Get PDF
    Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects

    Meta-analyses identify DNA methylation associated with kidney function and damage

    Get PDF
    Chronic kidney disease is a major public health burden. Elevated urinary albumin-to-creatinine ratio is a measure of kidney damage, and used to diagnose and stage chronic kidney disease. To extend the knowledge on regulatory mechanisms related to kidney function and disease, we conducted a blood-based epigenome-wide association study for estimated glomerular filtration rate (n = 33,605) and urinary albumin-to-creatinine ratio (n = 15,068) and detected 69 and seven CpG sites where DNA methylation was associated with the respective trait. The majority of these findings showed directionally consistent associations with the respective clinical outcomes chronic kidney disease and moderately increased albuminuria. Associations of DNA methylation with kidney function, such as CpGs at JAZF1, PELI1 and CHD2 were validated in kidney tissue. Methylation at PHRF1, LDB2, CSRNP1 and IRF5 indicated causal effects on kidney function. Enrichment analyses revealed pathways related to hemostasis and blood cell migration for estimated glomerular filtration rate, and immune cell activation and response for urinary albumin-to-creatinineratio-associated CpGs

    Genome-wide association studies identify 137 genetic loci for DNA methylation biomarkers of aging

    Get PDF
    Background Biological aging estimators derived from DNA methylation data are heritable and correlate with morbidity and mortality. Consequently, identification of genetic and environmental contributors to the variation in these measures in populations has become a major goal in the field. Results Leveraging DNA methylation and SNP data from more than 40,000 individuals, we identify 137 genome-wide significant loci, of which 113 are novel, from genome-wide association study (GWAS) meta-analyses of four epigenetic clocks and epigenetic surrogate markers for granulocyte proportions and plasminogen activator inhibitor 1 levels, respectively. We find evidence for shared genetic loci associated with the Horvath clock and expression of transcripts encoding genes linked to lipid metabolism and immune function. Notably, these loci are independent of those reported to regulate DNA methylation levels at constituent clock CpGs. A polygenic score for GrimAge acceleration showed strong associations with adiposity-related traits, educational attainment, parental longevity, and C-reactive protein levels. Conclusion This study illuminates the genetic architecture underlying epigenetic aging and its shared genetic contributions with lifestyle factors and longevity.Peer reviewe

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∌99% of the euchromatic genome and is accurate to an error rate of ∌1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Genome organization and chromatin analysis identify transcriptional downregulation of insulin-like growth factor signaling as a hallmark of aging in developing B cells.

    Get PDF
    BACKGROUND: Aging is characterized by loss of function of the adaptive immune system, but the underlying causes are poorly understood. To assess the molecular effects of aging on B cell development, we profiled gene expression and chromatin features genome-wide, including histone modifications and chromosome conformation, in bone marrow pro-B and pre-B cells from young and aged mice. RESULTS: Our analysis reveals that the expression levels of most genes are generally preserved in B cell precursors isolated from aged compared with young mice. Nonetheless, age-specific expression changes are observed at numerous genes, including microRNA encoding genes. Importantly, these changes are underpinned by multi-layered alterations in chromatin structure, including chromatin accessibility, histone modifications, long-range promoter interactions, and nuclear compartmentalization. Previous work has shown that differentiation is linked to changes in promoter-regulatory element interactions. We find that aging in B cell precursors is accompanied by rewiring of such interactions. We identify transcriptional downregulation of components of the insulin-like growth factor signaling pathway, in particular downregulation of Irs1 and upregulation of Let-7 microRNA expression, as a signature of the aged phenotype. These changes in expression are associated with specific alterations in H3K27me3 occupancy, suggesting that Polycomb-mediated repression plays a role in precursor B cell aging. CONCLUSIONS: Changes in chromatin and 3D genome organization play an important role in shaping the altered gene expression profile of aged precursor B cells. Components of the insulin-like growth factor signaling pathways are key targets of epigenetic regulation in aging in bone marrow B cell precursors

    Epigenome-wide association study of serum urate reveals insights into urate co-regulation and the SLC2A9 locus

    Get PDF
    Serum urate concentration can be studied in large datasets to find genetic and epigenetic loci that may be related to cardiometabolic traits. Here the authors identify and replicate 100 urate-associated CpGs, which provide insights into urate GWAS loci and shared CpGs of urate and cardiometabolic traits.Elevated serum urate levels, a complex trait and major risk factor for incident gout, are correlated with cardiometabolic traits via incompletely understood mechanisms. DNA methylation in whole blood captures genetic and environmental influences and is assessed in transethnic meta-analysis of epigenome-wide association studies (EWAS) of serum urate (discovery, n = 12,474, replication, n = 5522). The 100 replicated, epigenome-wide significant (p < 1.1E-7) CpGs explain 11.6% of the serum urate variance. At SLC2A9, the serum urate locus with the largest effect in genome-wide association studies (GWAS), five CpGs are associated with SLC2A9 gene expression. Four CpGs at SLC2A9 have significant causal effects on serum urate levels and/or gout, and two of these partly mediate the effects of urate-associated GWAS variants. In other genes, including SLC7A11 and PHGDH, 17 urate-associated CpGs are associated with conditions defining metabolic syndrome, suggesting that these CpGs may represent a blood DNA methylation signature of cardiometabolic risk factors. This study demonstrates that EWAS can provide new insights into GWAS loci and the correlation of serum urate with other complex traits.</p

    Genome-wide association studies identify 137 genetic loci for DNA methylation biomarkers of aging

    Get PDF
    Background Biological aging estimators derived from DNA methylation data are heritable and correlate with morbidity and mortality. Consequently, identification of genetic and environmental contributors to the variation in these measures in populations has become a major goal in the field. Results Leveraging DNA methylation and SNP data from more than 40,000 individuals, we identify 137 genome-wide significant loci, of which 113 are novel, from genome-wide association study (GWAS) meta-analyses of four epigenetic clocks and epigenetic surrogate markers for granulocyte proportions and plasminogen activator inhibitor 1 levels, respectively. We find evidence for shared genetic loci associated with the Horvath clock and expression of transcripts encoding genes linked to lipid metabolism and immune function. Notably, these loci are independent of those reported to regulate DNA methylation levels at constituent clock CpGs. A polygenic score for GrimAge acceleration showed strong associations with adiposity-related traits, educational attainment, parental longevity, and C-reactive protein levels. Conclusion This study illuminates the genetic architecture underlying epigenetic aging and its shared genetic contributions with lifestyle factors and longevity.</p
    • 

    corecore