1,615 research outputs found

    Deep Learning Models for Predicting Phenotypic Traits and Diseases from Omics Data

    Get PDF
    Computational analysis of high-throughput omics data, such as gene expressions, copy number alterations and DNA methylation (DNAm), has become popular in disease studies in recent decades because such analyses can be very helpful to predict whether a patient has certain disease or its subtypes. However, due to the high-dimensional nature of the data sets with hundreds of thousands of variables and very small number of samples, traditional machine learning approaches, such as support vector machines (SVMs) and random forests, have limitations to analyze these data efficiently. In this chapter, we reviewed the progress in applying deep learning algorithms to solve some biological questions. The focus is on potential software tools and public data sources for the tasks. Particularly, we show some case studies using deep neural network (DNN) models for classifying molecular subtypes of breast cancer and DNN-based regression models to account for interindividual variation in triglyceride concentrations measured at different visits of peripheral blood samples using DNAm profiles. We show that integration of multi-omics profiles into DNN-based learning methods could improve the prediction of the molecular subtypes of breast cancer. We also demonstrate the superiority of our proposed DNN models over the SVM model for predicting triglyceride concentrations

    Smart breeding driven by big data, artificial intelligence, and integrated genomic-enviromic prediction

    Get PDF
    The first paradigm of plant breeding involves direct selection-based phenotypic observation, followed by predictive breeding using statistical models for quantitative traits constructed based on genetic experimental design and, more recently, by incorporation of molecular marker genotypes. However, plant performance or phenotype (P) is determined by the combined effects of genotype (G), envirotype (E), and genotype by environment interaction (GEI). Phenotypes can be predicted more precisely by training a model using data collected from multiple sources, including spatiotemporal omics (genomics, phenomics, and enviromics across time and space). Integration of 3D information profiles (G-P-E), each with multidimensionality, provides predictive breeding with both tremendous opportunities and great challenges. Here, we first review innovative technologies for predictive breeding. We then evaluate multidimensional information profiles that can be integrated with a predictive breeding strategy, particularly envirotypic data, which have largely been neglected in data collection and are nearly untouched in model construction. We propose a smart breeding scheme, integrated genomic-enviromic prediction (iGEP), as an extension of genomic prediction, using integrated multiomics information, big data technology, and artificial intelligence (mainly focused on machine and deep learning). We discuss how to implement iGEP, including spatiotemporal models, environmental indices, factorial and spatiotemporal structure of plant breeding data, and cross-species prediction. A strategy is then proposed for prediction-based crop redesign at both the macro (individual, population, and species) and micro (gene, metabolism, and network) scales. Finally, we provide perspectives on translating smart breeding into genetic gain through integrative breeding platforms and open-source breeding initiatives. We call for coordinated efforts in smart breeding through iGEP, institutional partnerships, and innovative technological support

    Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

    Get PDF
    Background: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data. Results: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations. Conclusions: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight

    Integration of evidence across human and model organism studies: A meeting report.

    Get PDF
    The National Institute on Drug Abuse and Joint Institute for Biological Sciences at the Oak Ridge National Laboratory hosted a meeting attended by a diverse group of scientists with expertise in substance use disorders (SUDs), computational biology, and FAIR (Findability, Accessibility, Interoperability, and Reusability) data sharing. The meeting\u27s objective was to discuss and evaluate better strategies to integrate genetic, epigenetic, and \u27omics data across human and model organisms to achieve deeper mechanistic insight into SUDs. Specific topics were to (a) evaluate the current state of substance use genetics and genomics research and fundamental gaps, (b) identify opportunities and challenges of integration and sharing across species and data types, (c) identify current tools and resources for integration of genetic, epigenetic, and phenotypic data, (d) discuss steps and impediment related to data integration, and (e) outline future steps to support more effective collaboration-particularly between animal model research communities and human genetics and clinical research teams. This review summarizes key facets of this catalytic discussion with a focus on new opportunities and gaps in resources and knowledge on SUDs

    Survival-Related Clustering of Cancer Patients by Integrating Clinical and Biological Datasets

    Get PDF
    Subtype-based treatments and drug therapies are essential aspects to be considered in cancer patients\u27 clinical trials to provide appropriate personalized therapies. With the advancement of the next-generation sequencing technology, several computational models, integrating genomic and transcriptomic datasets (i.e., multi-omics) in the prediction of subtype-based classification in cancer patients, were emerged. However, integration of the prognostic features from the clinical data, related to survival risks with the multi-omics datasets in the prediction of different subtypes, is limited and an important research area to be explored. In this study, we proposed a data integration pipeline with the prognostic features from the clinical data and multi-omics datasets to predict the survival-risk-based subtypes in Kidney Renal Clear Cell Carcinoma (KIRC) patients from The Cancer Genome Atlas (TCGA) database. Firstly, we applied an unsupervised clustering algorithm on KIRC patients and clustered them into two survival-risk-based subgroups, i.e., subtypes. Then, using the clustering-based subtype labels as class labels for cancer patients, we trained a supervised classification model to determine the class label of un-labeled patients.In our clustering step, we applied multivariate Cox Proportional Hazard (Cox-PH) model to select the survival-related prognostically significant features (p-value \u3c 0.05) from the patients’ multivariate clinical data. Then, we used the Silhouette Coefficient to determine the optimal number (k) of the clusters. In our classification step, we integrated high dimensional multi-omics datasets with three different data modalities (such as gene expression, microRNA expression, and DNA methylation). We utilized a dimension-reduction approach, followed by a univariate Cox-PH for each reduced data modality with patients’ survival status. Then, we selected the survival-related reduced-omics-features in our classification model. In this step, we applied a supervised classification method with 10-fold cross-validation to check our survival-based subtype prediction accuracy. We tested multiple machine learning and deep learning algorithms in different steps of the pipeline for clustering (K-means, K-modes and, Gaussian mixture model), dimension-reduction (Denoising Autoencoder and Principal Component Analysis) and classification (Support Vector Machine and Random Forest) purposes. We proposed an optimized model with the highest survival-specific-subtype classification accuracy as the final model

    Frontiers in the Solicitation of Machine Learning Approaches in Vegetable Science Research

    Get PDF
    Along with essential nutrients and trace elements, vegetables provide raw materials for the food processing industry. Despite this, plant diseases and unfavorable weather patterns continue to threaten the delicate balance between vegetable production and consumption. It is critical to utilize machine learning (ML) in this setting because it provides context for decision-making related to breeding goals. Cutting-edge technologies for crop genome sequencing and phenotyping, combined with advances in computer science, are currently fueling a revolution in vegetable science and technology. Additionally, various ML techniques such as prediction, classification, and clustering are frequently used to forecast vegetable crop production in the field. In the vegetable seed industry, machine learning algorithms are used to assess seed quality before germination and have the potential to improve vegetable production with desired features significantly; whereas, in plant disease detection and management, the ML approaches can improve decision-support systems that assist in converting massive amounts of data into valuable recommendations. On similar lines, in vegetable breeding, ML approaches are helpful in predicting treatment results, such as what will happen if a gene is silenced. Furthermore, ML approaches can be a saviour to insufficient coverage and noisy data generated using various omics platforms. This article examines ML models in the field of vegetable sciences, which encompasses breeding, biotechnology, and genome sequencing
    • …
    corecore