4,502 research outputs found

    A multivariate approach to the integration of multi-omics datasets

    Get PDF
    Background: To leverage the potential of multi-omics studies, exploratory data analysis methods that provide systematic integration and comparison of multiple layers of omics information are required. We describe multiple co-inertia analysis (MCIA), an exploratory data analysis method that identifies co-relationships between multiple high dimensional datasets. Based on a covariance optimization criterion, MCIA simultaneously projects several datasets into the same dimensional space, transforming diverse sets of features onto the same scale, to extract the most variant from each dataset and facilitate biological interpretation and pathway analysis. Results: We demonstrate integration of multiple layers of information using MCIA, applied to two typical “omics” research scenarios. The integration of transcriptome and proteome profiles of cells in the NCI-60 cancer cell line panel revealed distinct, complementary features, which together increased the coverage and power of pathway analysis. Our analysis highlighted the importance of the leukemia extravasation signaling pathway in leukemia that was not highly ranked in the analysis of any individual dataset. Secondly, we compared transcriptome profiles of high grade serous ovarian tumors that were obtained, on two different microarray platforms and next generation RNA-sequencing, to identify the most informative platform and extract robust biomarkers of molecular subtypes. We discovered that the variance of RNA-sequencing data processed using RPKM had greater variance than that with MapSplice and RSEM. We provided novel markers highly associated to tumor molecular subtype combined from four data platforms. MCIA is implemented and available in the R/Bioconductor “omicade4” package. Conclusion: We believe MCIA is an attractive method for data integration and visualization of several datasets of multi-omics features observed on the same set of individuals. The method is not dependent on feature annotation, and thus it can extract important features even when there are not present across all datasets. MCIA provides simple graphical representations for the identification of relationships between large datasets

    Survival-Related Clustering of Cancer Patients by Integrating Clinical and Biological Datasets

    Get PDF
    Subtype-based treatments and drug therapies are essential aspects to be considered in cancer patients\u27 clinical trials to provide appropriate personalized therapies. With the advancement of the next-generation sequencing technology, several computational models, integrating genomic and transcriptomic datasets (i.e., multi-omics) in the prediction of subtype-based classification in cancer patients, were emerged. However, integration of the prognostic features from the clinical data, related to survival risks with the multi-omics datasets in the prediction of different subtypes, is limited and an important research area to be explored. In this study, we proposed a data integration pipeline with the prognostic features from the clinical data and multi-omics datasets to predict the survival-risk-based subtypes in Kidney Renal Clear Cell Carcinoma (KIRC) patients from The Cancer Genome Atlas (TCGA) database. Firstly, we applied an unsupervised clustering algorithm on KIRC patients and clustered them into two survival-risk-based subgroups, i.e., subtypes. Then, using the clustering-based subtype labels as class labels for cancer patients, we trained a supervised classification model to determine the class label of un-labeled patients.In our clustering step, we applied multivariate Cox Proportional Hazard (Cox-PH) model to select the survival-related prognostically significant features (p-value \u3c 0.05) from the patients’ multivariate clinical data. Then, we used the Silhouette Coefficient to determine the optimal number (k) of the clusters. In our classification step, we integrated high dimensional multi-omics datasets with three different data modalities (such as gene expression, microRNA expression, and DNA methylation). We utilized a dimension-reduction approach, followed by a univariate Cox-PH for each reduced data modality with patients’ survival status. Then, we selected the survival-related reduced-omics-features in our classification model. In this step, we applied a supervised classification method with 10-fold cross-validation to check our survival-based subtype prediction accuracy. We tested multiple machine learning and deep learning algorithms in different steps of the pipeline for clustering (K-means, K-modes and, Gaussian mixture model), dimension-reduction (Denoising Autoencoder and Principal Component Analysis) and classification (Support Vector Machine and Random Forest) purposes. We proposed an optimized model with the highest survival-specific-subtype classification accuracy as the final model

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Updates in metabolomics tools and resources: 2014-2015

    Get PDF
    Data processing and interpretation represent the most challenging and time-consuming steps in high-throughput metabolomic experiments, regardless of the analytical platforms (MS or NMR spectroscopy based) used for data acquisition. Improved machinery in metabolomics generates increasingly complex datasets that create the need for more and better processing and analysis software and in silico approaches to understand the resulting data. However, a comprehensive source of information describing the utility of the most recently developed and released metabolomics resources—in the form of tools, software, and databases—is currently lacking. Thus, here we provide an overview of freely-available, and open-source, tools, algorithms, and frameworks to make both upcoming and established metabolomics researchers aware of the recent developments in an attempt to advance and facilitate data processing workflows in their metabolomics research. The major topics include tools and researches for data processing, data annotation, and data visualization in MS and NMR-based metabolomics. Most in this review described tools are dedicated to untargeted metabolomics workflows; however, some more specialist tools are described as well. All tools and resources described including their analytical and computational platform dependencies are summarized in an overview Table

    Structured penalized regression for drug sensitivity prediction

    Full text link
    Large-scale {\it in vitro} drug sensitivity screens are an important tool in personalized oncology to predict the effectiveness of potential cancer drugs. The prediction of the sensitivity of cancer cell lines to a panel of drugs is a multivariate regression problem with high-dimensional heterogeneous multi-omics data as input data and with potentially strong correlations between the outcome variables which represent the sensitivity to the different drugs. We propose a joint penalized regression approach with structured penalty terms which allow us to utilize the correlation structure between drugs with group-lasso-type penalties and at the same time address the heterogeneity between omics data sources by introducing data-source-specific penalty factors to penalize different data sources differently. By combining integrative penalty factors (IPF) with tree-guided group lasso, we create the IPF-tree-lasso method. We present a unified framework to transform more general IPF-type methods to the original penalized method. Because the structured penalty terms have multiple parameters, we demonstrate how the interval-search Efficient Parameter Selection via Global Optimization (EPSGO) algorithm can be used to optimize multiple penalty parameters efficiently. Simulation studies show that IPF-tree-lasso can improve the prediction performance compared to other lasso-type methods, in particular for heterogenous data sources. Finally, we employ the new methods to analyse data from the Genomics of Drug Sensitivity in Cancer project.Comment: Zhao Z, Zucknick M (2020). Structured penalized regression for drug sensitivity prediction. Journal of the Royal Statistical Society, Series C. 19 pages, 6 figures and 2 table
    • …
    corecore