4,364 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Biologically informed risk scoring in schizophrenia based on genome-wide omics data

    Get PDF
    Extensive efforts in characterizing the biological architecture of schizophrenia have moved psychiatric research closer towards clinical application. As our understanding of psychiatric illness is slowly shifting towards a conceptualization as dimensional constructs that cut across traditional diagnostic boundaries, opportunities for personalized medicine applications that are afforded by the application of advanced data science methods on the increasingly available, large-scale and multimodal data repositories are starting to be more broadly recognized. A particularly intriguing phenomenon is the discrepancy between the high heritability of schizophrenia and the difficulty in identifying predictive genetic signatures, for which polygenic risk scores of common variants that explain approximately 18% of illness-associated variance remain the gold standard. A substantial body of research points towards two lines of investigation that may lead to a significant advance, resolve at least in part the ‘missing heritability’ phenomenon, and potentially provide the basis for more predictive, personalized clinical tools. First, it is paramount to better understand the impact of environmental factors on illness risk and elucidate the biology underlying their impact on altered brain function in schizophrenia. This thesis aims to close a major gap in our understanding of the multivariate, epigenetic landscape associated with schizophrenia, its interaction with polygenic risk and its association with DLPFC-HC connectivity, a well-established and robust neural intermediate phenotype of schizophrenia. As a basis for this, we have developed a novel biologically-informed machine learning framework by incorporating systems-level biological domain knowledge, i.e., gene ontological pathways, entitled ‘BioMM’ using genome-wide DNA methylation data obtained from whole blood samples. An epigenetic poly-methylation score termed ‘PMS’ was estimated at the individual level using BioMM, trained and validated using a total of 2230 whole-blood samples and 244 post-mortem brain samples. The pathways contributing most to this PMS were strongly associated with synaptic, neural and immune system-related functions. The identified PMS could be successfully validated in two independent cohorts, demonstrating the robust generalizability of the identified model. Furthermore, the PMS could significantly differentiate patients with schizophrenia from healthy controls when predicted in DLPFC post-mortem brain samples, suggesting that the epigenetic landscape of schizophrenia is to a certain extent shared between the central and peripheral tissues. Importantly, the peripheral PMS was associated with an intermediate neuroimaging phenotype (i.e., DLPFC-HC functional connectivity) in two independent imaging samples under the working memory paradigm. However, we did not find sufficient evidence for a combined genetic and epigenetic effect on brain function by integrating PRS derived from GWAS data, which suggested that DLPFC-HC coupling was predominantly impacted by environmental risk components, rather than polygenic risk of common variants. The epigenetic signature was further not associated with GWAS-derived risk scores implying the observed epigenetic effect did likely not depend on the underlying genetics, and this was further substantiated by investigation of data from unaffected first-degree relatives of patients with SCZ, BD, MDD and autism. In summary, the characterization of PMS through the systems-level integration of multimodal data elucidates the multivariate impact of epigenetic effects on schizophrenia-relevant brain function and its interdependence with genetic illness risk. Second, the limited predictive value of polygenic risk scores and the difficulty in identifying associations with heritable neural differences found in schizophrenia may be due to the possibility that the manifestation of the functional consequences of genetic risk is modulated by spatio-temporal as well as sex-specific effects. To address this, this thesis identifies sex-differences in the spatio-temporal expression trajectories during human development of genes that showed significant prefrontal co-expression with schizophrenia risk genes during the fetal phase and adolescence, consistent with a core developmental hypothesis of schizophrenia. More specifically, it was found that during these two time-periods, prefrontal expression was significantly more variable in males compared to females, a finding that could be validated in an independent data source and that was specific for schizophrenia compared to other psychiatric as well as somatic illnesses. Similar to the epigenetic differences described above, the genes underlying the risk-associated gene expression differences were significantly linked to synaptic function. Notably, individual genes with male-specific variability increases were distinct between the fetal phase and adolescence, potentially suggesting different risk associated mechanisms that converge on the shared synaptic involvement of these genes. These results provide substantial support to the hypothesis that the functional consequences of genetic risk show spatiotemporal specificity. Importantly, the temporal specificity was linked to the fetal phase and adolescence, time-periods that are thought to be of predominant importance for the brain-functional consequences of environmental risk exposure. Therefore, the presented results provide the basis for future studies exploring the polygenic risk architecture and its interaction with environmental effects in a multivariate and spatiotemporally stratified manner. In summary, the work presented in this thesis describes multivariate, multimodal approaches to characterize the (epi-)genetic basis of schizophrenia, explores its association with a well-established neural intermediate phenotype of the illness and investigates the spatio-temporal specificity of schizophrenia-relevant gene expression effects. This work expands our knowledge of the complex biology underlying schizophrenia and provides the basis for the future development of more predictive biological algorithms that may aid in advancing personalized medicine in psychiatry

    OmiEmbed: a unified multi-task deep learning framework for multi-omics data

    Full text link
    High-dimensional omics data contains intrinsic biomedical information that is crucial for personalised medicine. Nevertheless, it is challenging to capture them from the genome-wide data due to the large number of molecular features and small number of available samples, which is also called 'the curse of dimensionality' in machine learning. To tackle this problem and pave the way for machine learning aided precision medicine, we proposed a unified multi-task deep learning framework named OmiEmbed to capture biomedical information from high-dimensional omics data with the deep embedding and downstream task modules. The deep embedding module learnt an omics embedding that mapped multiple omics data types into a latent space with lower dimensionality. Based on the new representation of multi-omics data, different downstream task modules were trained simultaneously and efficiently with the multi-task strategy to predict the comprehensive phenotype profile of each sample. OmiEmbed support multiple tasks for omics data including dimensionality reduction, tumour type classification, multi-omics integration, demographic and clinical feature reconstruction, and survival prediction. The framework outperformed other methods on all three types of downstream tasks and achieved better performance with the multi-task strategy comparing to training them individually. OmiEmbed is a powerful and unified framework that can be widely adapted to various application of high-dimensional omics data and has a great potential to facilitate more accurate and personalised clinical decision making.Comment: 14 pages, 8 figures, 7 table

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Association Analysis Using Set-Based Approaches in the Post-GWAS Era

    Get PDF
    Genotyping arrays have greatly facilitated genetic epidemiological studies into genetic risk factors for numerous complex diseases such as psychiatric disorders. The use of genome-wide association analysis (GWAS) is unequivocally established. More recently, DNA methylation arrays have enabled genome-wide profiling of the methylome, in addition to contemporary genetic epidemiology study design. An example of one such study is the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) Lipidomics Study, which identified methylation markers (CpG markers) and single nucleotide polymorphisms (SNPs), associated with the change in triglyceride levels after drug intervention. Genotyping and methylation arrays assay several hundred thousand markers; however, single-marker association analysis suffers greatly from the burden of multiple testing. Set-based (SNP or CpG set) association approaches offer great flexibility, thus allowing the joint testing of a set of variants. For instance, a polygenic risk score (PRS) is a set-based approach, which, in addition to the strongly associated SNPs identified by large-scale GWAS, recruits SNPs with moderate to weak effects. The genotype information of the SNP set in the PRS is taken from an independent sample (target sample) and is then weighted by individual SNP effects derived from a relevant GWAS performed on a separate sample (discovery sample) into a cumulative score for each individual in the target sample. The resulting score, based on a SNP set or the PRS, is then regressed on the target phenotype. Such a regression model is evaluated by the amount of variance explained (R2) by the PRS in the target phenotype. Another strategy of set-based association analysis is kernel machine regression (KMR): a semi-parametric regression approach, in which the effects of markers within a set (CpG set or SNP set) are modelled via a kernel function and thus evaluated by a single-component variance test. A kernel function computes pairwise genomic similarity between the individuals, that is, the inner product of a set of variants under analysis, maybe comprising a gene or a biological pathway. For my first article, I performed a simulation study to evaluate the performance of PRS in correlated discovery and target traits by considering various sample sizes of the target sample, namely n=200, 500, and 1000. The PRS for correlated traits can be viewed as a situation of calculating schizophrenia-PRS for psychosocial endophenotypes such as global assessment functioning (GAF) score or positive and negative syndrome scale (PANSS) score. Considering such a situation, I simulated four correlated target traits that had varying degrees of correlation (r2) with the discovery trait, i.e., r2= 1.00, 0.8, 0.6, and 0.4. The results demonstrated that the average R2 estimates by the PRS roughly decreased by the square of the correlation between the target traits. In addition, the range of estimated R2 is most inflated in the sample size of the target trait n=200. Thus, the simulation findings alert researchers conducting clinical studies with endophenotypes to the fact that they need to pay attention to two important factors: first, the sample size of the target trait and secondly, the shared amount of genetic correlation between the target and discovery traits. In my second article, I implemented a KMR approach for set-based association testing of a CpG set. KMR has been successfully employed on SNP sets. In preparation of the second article, I used real and simulated datasets (based on a real dataset) provided by the Genetic Analysis Workshop 20 (GAW20) from the GOLDN study. GOLDN is a longitudinal study with individuals recruited from pedigrees. In my analysis, I only used independent individuals, which restricted the sample size in the real and simulated datasets to n<200. CpG sets were devised using the evidence of association reported by the GOLDN study in the real data set. For simulated datasets, true causal CpGs were provided by GAW20. Thus, I formulated candidate genomic regions of varying lengths while keeping the associated CpG(s) inside the region. The results replicated the evidence of association reported by GOLDN in the real data, and in simulated datasets albeit nominally. Moreover, in the simulated data, causal SNPs exert their full effect on the phenoytpes given when the causal CpG loci had no methylation (B-value=0). Thus, I also considered modelling an interaction term along with the main effects. The results yielded significant association. As part of the discussion, simulation results on the performance of the linear kernel for a CpG set with original (B-values) and logit transformed methylation values (M-values) indicated that logit transformation results in a loss of power. There, I also considered analysing an additive kernel that combines the genotype kernel and the methylation kernel and then tests for association with the phenotype. The initial simulations suggest that an additive kernel with a CpG set including hypo, semi, and hypermethylated sites simultaneously might not improve the model over only including a SNP set. However, it appears fruitful to investigate further the situation in which only one type of methylation state is present in a CpG set

    The Pharmacoepigenomics Informatics Pipeline and H-GREEN Hi-C Compiler: Discovering Pharmacogenomic Variants and Pathways with the Epigenome and Spatial Genome

    Full text link
    Over the last decade, biomedical science has been transformed by the epigenome and spatial genome, but the discipline of pharmacogenomics, the study of the genetic underpinnings of pharmacological phenotypes like drug response and adverse events, has not. Scientists have begun to use omics atlases of increasing depth, and inferences relating to the bidirectional causal relationship between the spatial epigenome and gene expression, as a foundational underpinning for genetics research. The epigenome and spatial genome are increasingly used to discover causative regulatory variants in the significance regions of genome-wide association studies, for the discovery of the biological mechanisms underlying these phenotypes and the design of genetic tests to predict them. Such variants often have more predictive power than coding variants, but in the area of pharmacogenomics, such advances have been radically underapplied. The majority of pharmacogenomics tests are designed manually on the basis of mechanistic work with coding variants in candidate genes, and where genome wide approaches are used, they are typically not interpreted with the epigenome. This work describes a series of analyses of pharmacogenomics association studies with the tools and datasets of the epigenome and spatial genome, undertaken with the intent of discovering causative regulatory variants to enable new genetic tests. It describes the potent regulatory variants discovered thereby to have a putative causative and predictive role in a number of medically important phenotypes, including analgesia and the treatment of depression, bipolar disorder, and traumatic brain injury with opiates, anxiolytics, antidepressants, lithium, and valproate, and in particular the tendency for such variants to cluster into spatially interacting, conceptually unified pathways which offer mechanistic insight into these phenotypes. It describes the Pharmacoepigenomics Informatics Pipeline (PIP), an integrative multiple omics variant discovery pipeline designed to make this kind of analysis easier and cheaper to perform, more reproducible, and amenable to the addition of advanced features. It described the successes of the PIP in rediscovering manually discovered gene networks for lithium response, as well as discovering a previously unknown genetic basis for warfarin response in anticoagulation therapy. It describes the H-GREEN Hi-C compiler, which was designed to analyze spatial genome data and discover the distant target genes of such regulatory variants, and its success in discovering spatial contacts not detectable by preceding methods and using them to build spatial contact networks that unite disparate TADs with phenotypic relationships. It describes a potential featureset of a future pipeline, using the latest epigenome research and the lessons of the previous pipeline. It describes my thinking about how to use the output of a multiple omics variant pipeline to design genetic tests that also incorporate clinical data. And it concludes by describing a long term vision for a comprehensive pharmacophenomic atlas, to be constructed by applying a variant pipeline and machine learning test design system, such as is described, to thousands of phenotypes in parallel. Scientists struggled to assay genotypes for the better part of a century, and in the last twenty years, succeeded. The struggle to predict phenotypes on the basis of the genotypes we assay remains ongoing. The use of multiple omics variant pipelines and machine learning models with omics atlases, genetic association, and medical records data will be an increasingly significant part of that struggle for the foreseeable future.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145835/1/ariallyn_1.pd
    • 

    corecore