59 research outputs found
Initializing Services in Interactive ML Systems for Diverse Users
This paper studies ML systems that interactively learn from users across
multiple subpopulations with heterogeneous data distributions. The primary
objective is to provide specialized services for different user groups while
also predicting user preferences. Once the users select a service based on how
well the service anticipated their preference, the services subsequently adapt
and refine themselves based on the user data they accumulate, resulting in an
iterative, alternating minimization process between users and services
(learning dynamics). Employing such tailored approaches has two main
challenges: (i) Unknown user preferences: Typically, data on user preferences
are unavailable without interaction, and uniform data collection across a large
and diverse user base can be prohibitively expensive. (ii) Suboptimal Local
Solutions: The total loss (sum of loss functions across all users and all
services) landscape is not convex even if the individual losses on a single
service are convex, making it likely for the learning dynamics to get stuck in
local minima. The final outcome of the aforementioned learning dynamics is thus
strongly influenced by the initial set of services offered to users, and is not
guaranteed to be close to the globally optimal outcome. In this work, we
propose a randomized algorithm to adaptively select very few users to collect
preference data from, while simultaneously initializing a set of services. We
prove that under mild assumptions on the loss functions, the expected total
loss achieved by the algorithm right after initialization is within a factor of
the globally optimal total loss with complete user preference data, and this
factor scales only logarithmically in the number of services. Our theory is
complemented by experiments on real as well as semi-synthetic datasets
Development and Validation of ML-DQA -- a Machine Learning Data Quality Assurance Framework for Healthcare
The approaches by which the machine learning and clinical research
communities utilize real world data (RWD), including data captured in the
electronic health record (EHR), vary dramatically. While clinical researchers
cautiously use RWD for clinical investigations, ML for healthcare teams consume
public datasets with minimal scrutiny to develop new algorithms. This study
bridges this gap by developing and validating ML-DQA, a data quality assurance
framework grounded in RWD best practices. The ML-DQA framework is applied to
five ML projects across two geographies, different medical conditions, and
different cohorts. A total of 2,999 quality checks and 24 quality reports were
generated on RWD gathered on 247,536 patients across the five projects. Five
generalizable practices emerge: all projects used a similar method to group
redundant data element representations; all projects used automated utilities
to build diagnosis and medication data elements; all projects used a common
library of rules-based transformations; all projects used a unified approach to
assign data quality checks to data elements; and all projects used a similar
approach to clinical adjudication. An average of 5.8 individuals, including
clinicians, data scientists, and trainees, were involved in implementing ML-DQA
for each project and an average of 23.4 data elements per project were either
transformed or removed in response to ML-DQA. This study demonstrates the
importance role of ML-DQA in healthcare projects and provides teams a framework
to conduct these essential activities.Comment: Presented at 2022 Machine Learning in Health Care Conferenc
Epigenome-wide association study of serum urate reveals insights into urate co-regulation and the SLC2A9 locus
Elevated serum urate levels, a complex trait and major risk factor for incident gout, are correlated with cardiometabolic traits via incompletely understood mechanisms. DNA methylation in whole blood captures genetic and environmental influences and is assessed in transethnic meta-analysis of epigenome-wide association studies (EWAS) of serum urate (discovery, n = 12,474, replication, n = 5522). The 100 replicated, epigenome-wide significant (p < 1.1E–7) CpGs explain 11.6% of the serum urate variance. At SLC2A9, the serum urate locus with the largest effect in genome-wide association studies (GWAS), five CpGs are associated with SLC2A9 gene expression. Four CpGs at SLC2A9 have significant causal effects on serum urate levels and/or gout, and two of these partly mediate the effects of urate-associated GWAS variants. In other genes, including SLC7A11 and PHGDH, 17 urate-associated CpGs are associated with conditions defining metabolic syndrome, suggesting that these CpGs may represent a blood DNA methylation signature of cardiometabolic risk factors. This study demonstrates that EWAS can provide new insights into GWAS loci and the correlation of serum urate with other complex traits
Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects
Estimates from genome-wide association studies (GWAS) of unrelated individuals capture effects of inherited variation (direct effects), demography (population stratification, assortative mating) and relatives (indirect genetic effects). Family-based GWAS designs can control for demographic and indirect genetic effects, but large-scale family datasets have been lacking. We combined data from 178,086 siblings from 19 cohorts to generate population (between-family) and within-sibship (within-family) GWAS estimates for 25 phenotypes. Within-sibship GWAS estimates were smaller than population estimates for height, educational attainment, age at first birth, number of children, cognitive ability, depressive symptoms and smoking. Some differences were observed in downstream SNP heritability, genetic correlations and Mendelian randomization analyses. For example, the within-sibship genetic correlation between educational attainment and body mass index attenuated towards zero. In contrast, analyses of most molecular phenotypes (for example, low-density lipoprotein-cholesterol) were generally consistent. We also found within-sibship evidence of polygenic adaptation on taller height. Here, we illustrate the importance of family-based GWAS data for phenotypes influenced by demographic and indirect genetic effects
Meta-analyses identify DNA methylation associated with kidney function and damage
Chronic kidney disease is a major public health burden. Elevated urinary albumin-to-creatinine ratio is a measure of kidney damage, and used to diagnose and stage chronic kidney disease. To extend the knowledge on regulatory mechanisms related to kidney function and disease, we conducted a blood-based epigenome-wide association study for estimated glomerular filtration rate (n = 33,605) and urinary albumin-to-creatinine ratio (n = 15,068) and detected 69 and seven CpG sites where DNA methylation was associated with the respective trait. The majority of these findings showed directionally consistent associations with the respective clinical outcomes chronic kidney disease and moderately increased albuminuria. Associations of DNA methylation with kidney function, such as CpGs at JAZF1, PELI1 and CHD2 were validated in kidney tissue. Methylation at PHRF1, LDB2, CSRNP1 and IRF5 indicated causal effects on kidney function. Enrichment analyses revealed pathways related to hemostasis and blood cell migration for estimated glomerular filtration rate, and immune cell activation and response for urinary albumin-to-creatinineratio-associated CpGs
Genome-wide association studies identify 137 genetic loci for DNA methylation biomarkers of aging
Background Biological aging estimators derived from DNA methylation data are heritable and correlate with morbidity and mortality. Consequently, identification of genetic and environmental contributors to the variation in these measures in populations has become a major goal in the field. Results Leveraging DNA methylation and SNP data from more than 40,000 individuals, we identify 137 genome-wide significant loci, of which 113 are novel, from genome-wide association study (GWAS) meta-analyses of four epigenetic clocks and epigenetic surrogate markers for granulocyte proportions and plasminogen activator inhibitor 1 levels, respectively. We find evidence for shared genetic loci associated with the Horvath clock and expression of transcripts encoding genes linked to lipid metabolism and immune function. Notably, these loci are independent of those reported to regulate DNA methylation levels at constituent clock CpGs. A polygenic score for GrimAge acceleration showed strong associations with adiposity-related traits, educational attainment, parental longevity, and C-reactive protein levels. Conclusion This study illuminates the genetic architecture underlying epigenetic aging and its shared genetic contributions with lifestyle factors and longevity.Peer reviewe
Finishing the euchromatic sequence of the human genome
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead
Genome organization and chromatin analysis identify transcriptional downregulation of insulin-like growth factor signaling as a hallmark of aging in developing B cells.
BACKGROUND: Aging is characterized by loss of function of the adaptive immune system, but the underlying causes are poorly understood. To assess the molecular effects of aging on B cell development, we profiled gene expression and chromatin features genome-wide, including histone modifications and chromosome conformation, in bone marrow pro-B and pre-B cells from young and aged mice. RESULTS: Our analysis reveals that the expression levels of most genes are generally preserved in B cell precursors isolated from aged compared with young mice. Nonetheless, age-specific expression changes are observed at numerous genes, including microRNA encoding genes. Importantly, these changes are underpinned by multi-layered alterations in chromatin structure, including chromatin accessibility, histone modifications, long-range promoter interactions, and nuclear compartmentalization. Previous work has shown that differentiation is linked to changes in promoter-regulatory element interactions. We find that aging in B cell precursors is accompanied by rewiring of such interactions. We identify transcriptional downregulation of components of the insulin-like growth factor signaling pathway, in particular downregulation of Irs1 and upregulation of Let-7 microRNA expression, as a signature of the aged phenotype. These changes in expression are associated with specific alterations in H3K27me3 occupancy, suggesting that Polycomb-mediated repression plays a role in precursor B cell aging. CONCLUSIONS: Changes in chromatin and 3D genome organization play an important role in shaping the altered gene expression profile of aged precursor B cells. Components of the insulin-like growth factor signaling pathways are key targets of epigenetic regulation in aging in bone marrow B cell precursors
Epigenome-wide association study of serum urate reveals insights into urate co-regulation and the SLC2A9 locus
Serum urate concentration can be studied in large datasets to find genetic and epigenetic loci that may be related to cardiometabolic traits. Here the authors identify and replicate 100 urate-associated CpGs, which provide insights into urate GWAS loci and shared CpGs of urate and cardiometabolic traits.Elevated serum urate levels, a complex trait and major risk factor for incident gout, are correlated with cardiometabolic traits via incompletely understood mechanisms. DNA methylation in whole blood captures genetic and environmental influences and is assessed in transethnic meta-analysis of epigenome-wide association studies (EWAS) of serum urate (discovery, n = 12,474, replication, n = 5522). The 100 replicated, epigenome-wide significant (p < 1.1E-7) CpGs explain 11.6% of the serum urate variance. At SLC2A9, the serum urate locus with the largest effect in genome-wide association studies (GWAS), five CpGs are associated with SLC2A9 gene expression. Four CpGs at SLC2A9 have significant causal effects on serum urate levels and/or gout, and two of these partly mediate the effects of urate-associated GWAS variants. In other genes, including SLC7A11 and PHGDH, 17 urate-associated CpGs are associated with conditions defining metabolic syndrome, suggesting that these CpGs may represent a blood DNA methylation signature of cardiometabolic risk factors. This study demonstrates that EWAS can provide new insights into GWAS loci and the correlation of serum urate with other complex traits.</p
- …