3,082 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Autoencoded DNA methylation data to predict breast cancer recurrence: Machine learning models and gene-weight significance

    Get PDF
    Breast cancer is the most frequent cancer in women and the second most frequent overall after lung cancer. Although the 5-year survival rate of breast cancer is relatively high, recurrence is also common which often involves metastasis with its consequent threat for patients. DNA methylation-derived databases have become an interesting primary source for supervised knowledge extraction regarding breast cancer. Unfortunately, the study of DNA methylation involves the processing of hundreds of thousands of features for every patient. DNA methylation is featured by High Dimension Low Sample Size which has shown well-known issues regarding feature selection and generation. Autoencoders (AEs) appear as a specific technique for conducting nonlinear feature fusion. Our main objective in this work is to design a procedure to summarize DNA methylation by taking advantage of AEs. Our proposal is able to generate new features from the values of CpG sites of patients with and without recurrence. Then, a limited set of relevant genes to characterize breast cancer recurrence is proposed by the application of survival analysis and a pondered ranking of genes according to the distribution of their CpG sites. To test our proposal we have selected a dataset from The Cancer Genome Atlas data portal and an AE with a single-hidden layer. The literature and enrichment analysis (based on genomic context and functional annota tion) conducted regarding the genes obtained with our experiment confirmed that all of these genes were related to breast cancer recurrence.Ministerio de Economía y Competitividad TIN2014-55894-C2-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-

    Cancer risk prediction with whole exome sequencing and machine learning

    Get PDF
    Accurate cancer risk and survival time prediction are important problems in personalized medicine, where disease diagnosis and prognosis are tuned to individuals based on their genetic material. Cancer risk prediction provides an informed decision about making regular screening that helps to detect disease at the early stage and therefore increases the probability of successful treatments. Cancer risk prediction is a challenging problem. Lifestyle, environment, family history, and genetic predisposition are some factors that influence the disease onset. Cancer risk prediction based on predisposing genetic variants has been studied extensively. Most studies have examined the predictive ability of variants in known mutated genes for specific cancers. However, previous studies have not explored the predictive ability of collective genomic variants from whole-exome sequencing data. It is crucial to train a model in one study and predict another related independent study to ensure that the predictive model generalizes to other datasets. Survival time prediction allows patients and physicians to evaluate the treatment feasibility and helps chart health treatment plans. Many studies have concluded that clinicians are inaccurate and often optimistic in predicting patients’ survival time; therefore, the need increases for automated survival time prediction from genomic and medical imaging data. For cancer risk prediction, this dissertation explores the effectiveness of ranking genomic variants in whole-exome sequencing data with univariate features selection methods on the predictive capability of machine learning classifiers. The dissertation performs cross-study in chronic lymphocytic leukemia, glioma, and kidney cancers that show that the top-ranked variants achieve better accuracy than the whole genomic variants. For survival time prediction, many studies have devised 3D convolutional neural networks (CNNs) to improve the accuracy of structural magnetic resonance imaging (MRI) volumes to classify glioma patients into survival categories. This dissertation proposes a new multi-path convolutional neural network with SNP and demographic features to predict glioblastoma survival groups with a one-year threshold that improves upon existing machine learning methods. The dissertation also proposes a multi-path neural network system to predict glioblastoma survival categories with a 14-year threshold from a heterogeneous combination of genomic variations, messenger ribonucleic acid (RNA) expressions, 3D post-contrast T1 MRI volumes, and 2D post-contrast T1 MRI modality scans that show the malignancy. In 10-fold cross-validation, the mean 10-fold accuracy of the proposed network with handpicked 2D MRI slices (that manifest the tumor), mRNA expressions, and SNPs slightly improves upon each data source individually

    Transcriptomics in Toxicogenomics, Part III: Data Modelling for Risk Assessment

    Get PDF
    Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics

    Transcriptomics in Toxicogenomics, Part III : Data Modelling for Risk Assessment

    Get PDF
    Transcriptomics data are relevant to address a number of challenges in Toxicogenomics (TGx). After careful planning of exposure conditions and data preprocessing, the TGx data can be used in predictive toxicology, where more advanced modelling techniques are applied. The large volume of molecular profiles produced by omics-based technologies allows the development and application of artificial intelligence (AI) methods in TGx. Indeed, the publicly available omics datasets are constantly increasing together with a plethora of different methods that are made available to facilitate their analysis, interpretation and the generation of accurate and stable predictive models. In this review, we present the state-of-the-art of data modelling applied to transcriptomics data in TGx. We show how the benchmark dose (BMD) analysis can be applied to TGx data. We review read across and adverse outcome pathways (AOP) modelling methodologies. We discuss how network-based approaches can be successfully employed to clarify the mechanism of action (MOA) or specific biomarkers of exposure. We also describe the main AI methodologies applied to TGx data to create predictive classification and regression models and we address current challenges. Finally, we present a short description of deep learning (DL) and data integration methodologies applied in these contexts. Modelling of TGx data represents a valuable tool for more accurate chemical safety assessment. This review is the third part of a three-article series on Transcriptomics in Toxicogenomics.Peer reviewe

    Machine Learning Methods for Brain Image Analysis

    Get PDF
    Understanding how the brain functions and quantifying compound interactions between complex synaptic networks inside the brain remain some of the most challenging problems in neuroscience. Lack or abundance of data, shortage of manpower along with heterogeneity of data following from various species all served as an added complexity to the already perplexing problem. The ability to process vast amount of brain data need to be performed automatically, yet with an accuracy close to manual human-level performance. These automated methods essentially need to generalize well to be able to accommodate data from different species. Also, novel approaches and techniques are becoming a necessity to reveal the correlations between different data modalities in the brain at the global level. In this dissertation, I mainly focus on two problems: automatic segmentation of brain electron microscopy (EM) images and stacks, and integrative analysis of the gene expression and synaptic connectivity in the brain. I propose to use deep learning algorithms for the 2D segmentation of EM images. I designed an automated pipeline with novel insights that was able to achieve state-of-the-art performance on the segmentation of the \textit{Drosophila} brain. I also propose a novel technique for 3D segmentation of EM image stacks that can be trained end-to-end with no prior knowledge of the data. This technique was evaluated in an ongoing online challenge for 3D segmentation of neurites where it achieved accuracy close to a second human observer. Later, I employed ensemble learning methods to perform the first systematic integrative analysis of the genome and connectome in the mouse brain at both the regional- and voxel-level. I show that the connectivity signals can be predicted from the gene expression signatures with an extremely high accuracy. Furthermore, I show that only a certain fraction of genes are responsible for this predictive aspect. Rich functional and cellular analysis of these genes are detailed to validate these findings

    What I talk about when I talk about integration of single-cell data

    Get PDF
    Over the past decade, single-cell technologies evolved from profiling hundreds of cells to millions of cells, and emerged from a single modality of data to cover multiple views at single-cell resolution, including genome, epigenome, transcriptome, and so on. With advance of these single-cell technologies, the booming of multimodal single-cell data creates a valuable resource for us to understand cellular heterogeneity and molecular mechanism at a comprehensive level. However, the large-scale multimodal single-cell data also presents a huge computational challenge for insightful integrative analysis. Here, I will lay out problems in data integration that single-cell research community is interested in and introduce computational principles for solving these integration problems. In the following chapters, I will present four computational methods for data integration under different scenarios. Finally, I will discuss some future directions and potential applications of single-cell data integration
    corecore