182 research outputs found

    Prediction with Dimension Reduction of Multiple Molecular Data Sources for Patient Survival

    Full text link
    Predictive modeling from high-dimensional genomic data is often preceded by a dimension reduction step, such as principal components analysis (PCA). However, the application of PCA is not straightforward for multi-source data, wherein multiple sources of 'omics data measure different but related biological components. In this article we utilize recent advances in the dimension reduction of multi-source data for predictive modeling. In particular, we apply exploratory results from Joint and Individual Variation Explained (JIVE), an extension of PCA for multi-source data, for prediction of differing response types. We conduct illustrative simulations to illustrate the practical advantages and interpretability of our approach. As an application example we consider predicting survival for Glioblastoma Multiforme (GBM) patients from three data sources measuring mRNA expression, miRNA expression, and DNA methylation. We also introduce a method to estimate JIVE scores for new samples that were not used in the initial dimension reduction, and study its theoretical properties; this method is implemented in the R package R.JIVE on CRAN, in the function 'jive.predict'.Comment: 11 pages, 9 figure

    A primer on correlation-based dimension reduction methods for multi-omics analysis

    Full text link
    The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table

    Survival regression by data fusion

    Get PDF
    Any knowledge discovery could in principal benefit from the fusion of directly or even indirectly related data sources. In this paper we explore whether data fusion by simultaneous matrix factorization could be adapted for survival regression. We propose a new method that jointly infers latent data factors from a number of heterogeneous data sets and estimates regression coefficients of a survival model. We have applied the method to CAMDA 2014 large- scale Cancer Genomes Challenge and modeled survival time as a function of gene, protein and miRNA expression data, and data on methylated and mutated regions. We find that both joint inference of data factors and regression coefficients and data fusion procedure are crucial for performance. Our approach is substantially more accurate than the baseline Aalen’s additive model. Latent factors inferred by our approach could be mined further; for CAMDA challenge, we found that the most informative factors are related to known cancer processes

    Survival regression by data fusion

    Get PDF
    Any knowledge discovery could in principal benefit from the fusion of directly or even indirectly related data sources. In this paper we explore whether data fusion by simultaneous matrix factorization could be adapted for survival regression. We propose a new method that jointly infers latent data factors from a number of heterogeneous data sets and estimates regression coefficients of a survival model. We have applied the method to CAMDA 2014 large- scale Cancer Genomes Challenge and modeled survival time as a function of gene, protein and miRNA expression data, and data on methylated and mutated regions. We find that both joint inference of data factors and regression coefficients and data fusion procedure are crucial for performance. Our approach is substantially more accurate than the baseline Aalen’s additive model. Latent factors inferred by our approach could be mined further; for CAMDA challenge, we found that the most informative factors are related to known cancer processes

    Faktorizacija matrik nizkega ranga pri učenju z večjedrnimi metodami

    Full text link
    The increased rate of data collection, storage, and availability results in a corresponding interest for data analyses and predictive models based on simultaneous inclusion of multiple data sources. This tendency is ubiquitous in practical applications of machine learning, including recommender systems, social network analysis, finance and computational biology. The heterogeneity and size of the typical datasets calls for simultaneous dimensionality reduction and inference from multiple data sources in a single model. Matrix factorization and multiple kernel learning models are two general approaches that satisfy this goal. This work focuses on two specific goals, namely i) finding interpretable, non-overlapping (orthogonal) data representations through matrix factorization and ii) regression with multiple kernels through the low-rank approximation of the corresponding kernel matrices, providing non-linear outputs and interpretation of kernel selection. The motivation for the models and algorithms designed in this work stems from RNA biology and the rich complexity of protein-RNA interactions. Although the regulation of RNA fate happens at many levels - bringing in various possible data views - we show how different questions can be answered directly through constraints in the model design. We have developed an integrative orthogonality nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites and report on a number of inferred protein-specific patterns, consistent with experimentally determined properties. A principled way to extend the linear models to non-linear settings are kernel methods. Multiple kernel learning enables modelling with different data views, but are limited by the quadratic computation and storage complexity of the kernel matrix. Considerable savings in time and memory can be expected if kernel approximation and multiple kernel learning are performed simultaneously. We present the Mklaren algorithm, which achieves this goal via Incomplete Cholesky Decomposition, where the selection of basis functions is based on Least-angle regression, resulting in linear complexity both in the number of data points and kernels. Considerable savings in approximation rank are observed when compared to general kernel matrix decompositions and comparable to methods specialized to particular kernel function families. The principal advantages of Mklaren are independence of kernel function form, robust inducing point selection and the ability to use different kernels in different regions of both continuous and discrete input spaces, such as numeric vector spaces, strings or trees, providing a platform for bioinformatics. In summary, we design novel models and algorithms based on matrix factorization and kernel learning, combining regression, insights into the domain of interest by identifying relevant patterns, kernels and inducing points, while scaling to millions of data points and data views.V času pospešenega zbiranja, organiziranja in dostopnosti podatkov se pojavlja potreba po razvoju napovednih modelov na osnovi hkratnega učenja iz več podatkovnih virov. Konkretni primeri uporabe obsegajo področja strojnega učenja, priporočilnih sistemov, socialnih omrežij, financ in računske biologije. Heterogenost in velikost tipičnih podatkovnih zbirk vodi razvoj postopkov za hkratno zmanjšanje velikosti (zgoščevanje) in sklepanje iz več virov podatkov v skupnem modelu. Matrična faktorizacija in jedrne metode (ang. kernel methods) sta dve splošni orodji, ki omogočata dosego navedenega cilja. Pričujoče delo se osredotoča na naslednja specifična cilja: i) iskanje interpretabilnih, neprekrivajočih predstavitev vzorcev v podatkih s pomočjo ortogonalne matrične faktorizacije in ii) nadzorovano hkratno faktorizacijo več jedrnih matrik, ki omogoča modeliranje nelinearnih odzivov in interpretacijo pomembnosti različnih podatkovnih virov. Motivacija za razvoj modelov in algoritmov v pričujočem delu izhaja iz RNA biologije in bogate kompleksnosti interakcij med proteini in RNA molekulami v celici. Čeprav se regulacija RNA dogaja na več različnih nivojih - kar vodi v več podatkovnih virov/pogledov - lahko veliko lastnosti regulacije odkrijemo s pomočjo omejitev v fazi modeliranja. V delu predstavimo postopek hkratne matrične faktorizacije z omejitvijo, da se posamezni vzorci v podatkih ne prekrivajo med seboj - so neodvisni oz. ortogonalni. V praksi to pomeni, da lahko odkrijemo različne, neprekrivajoče načine regulacije RNA s strani različnih proteinov. Z vzključitvijo več podatkovnih virov izboljšamo napovedno točnost pri napovedovanju potencialnih vezavnih mest posameznega RNA-vezavnega proteina. Vzorci, odkriti iz podatkov so primerljivi z eksperimentalno določenimi lastnostmi proteinov in obsegajo kratka zaporedja nukleotidov na RNA, kooperativno vezavo z drugimi proteini, RNA strukturnimi lastnostmi ter funkcijsko anotacijo. Klasične metode matrične faktorizacije tipično temeljijo na linearnih modelih podatkov. Jedrne metode so eden od načinov za razširitev modelov matrične faktorizacije za modeliranje nelinearnih odzivov. Učenje z več jedri (ang. Multiple kernel learning) omogoča učenje iz več podatkovnih virov, a je omejeno s kvadratno računsko zahtevnostjo v odvisnosti od števila primerov v podatkih. To omejitev odpravimo z ustreznimi približki pri izračunu jedrnih matrik (ang. kernel matrix). V ta namen izboljšamo obstoječe metode na način, da hkrati izračunamo aproksimacijo jedrnih matrik ter njihovo linearno kombinacijo, ki modelira podan tarčni odziv. To dosežemo z metodo Mklaren (ang. Multiple kernel learning based on Least-angle regression), ki je sestavljena iz Nepopolnega razcepa Choleskega in Regresije najmanjših kotov (ang. Least-angle regression). Načrt algoritma vodi v linearno časovno in prostorsko odvisnost tako glede na število primerov v podatkih kot tudi glede na število jedrnih funkcij. Osnovne prednosti postopka so poleg računske odvisnosti tudi splošnost oz. neodvisnost od uporabljenih jedrnih funkcij. Tako lahko uporabimo različne, splošne jedrne funkcije za modeliranje različnih delov prostora vhodnih podatkov, ki so lahko zvezni ali diskretni, npr. vektorski prostori, prostori nizov znakov in drugih podatkovnih struktur, kar je prikladno za uporabo v bioinformatiki. V delu tako razvijemo algoritme na osnovi hkratne matrične faktorizacije in jedrnih metod, obravnavnamo modele linearne in nelinearne regresije ter interpretacije podatkovne domene - odkrijemo pomembna jedra in primere podatkov, pri čemer je metode mogoče poganjati na milijonih podatkovnih primerov in virov

    Integration of multi-scale protein interactions for biomedical data analysis

    Get PDF
    With the advancement of modern technologies, we observe an increasing accumulation of biomedical data about diseases. There is a need for computational methods to sift through and extract knowledge from the diverse data available in order to improve our mechanistic understanding of diseases and improve patient care. Biomedical data come in various forms as exemplified by the various omics data. Existing studies have shown that each form of omics data gives only partial information on cells state and motivated jointly mining multi-omics, multi-modal data to extract integrated system knowledge. The interactome is of particular importance as it enables the modelling of dependencies arising from molecular interactions. This Thesis takes a special interest in the multi-scale protein interactome and its integration with computational models to extract relevant information from biomedical data. We define multi-scale interactions at different omics scale that involve proteins: pairwise protein-protein interactions, multi-protein complexes, and biological pathways. Using hypergraph representations, we motivate considering higher-order protein interactions, highlighting the complementary biological information contained in the multi-scale interactome. Based on those results, we further investigate how those multi-scale protein interactions can be used as either prior knowledge, or auxiliary data to develop machine learning algorithms. First, we design a neural network using the multi-scale organization of proteins in a cell into biological pathways as prior knowledge and train it to predict a patient's diagnosis based on transcriptomics data. From the trained models, we develop a strategy to extract biomedical knowledge pertaining to the diseases investigated. Second, we propose a general framework based on Non-negative Matrix Factorization to integrate the multi-scale protein interactome with multi-omics data. We show that our approach outperforms the existing methods, provide biomedical insights and relevant hypotheses for specific cancer types

    Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data

    Full text link
    We introduce a general tensor model suitable for data analytic tasks for heterogeneous data sets, wherein there are joint low-rank structures within groups of observations, but also discriminative structures across different groups. To capture such complex structures, a double core tensor (DCOT) factorization model is introduced together with a family of smoothing loss functions. By leveraging the proposed smoothing function, the model accurately estimates the model factors, even in the presence of missing entries. A linearized ADMM method is employed to solve regularized versions of DCOT factorizations, that avoid large tensor operations and large memory storage requirements. Further, we establish theoretically its global convergence, together with consistency of the estimates of the model parameters. The effectiveness of the DCOT model is illustrated on several real-world examples including image completion, recommender systems, subspace clustering and detecting modules in heterogeneous Omics multi-modal data, since it provides more insightful decompositions than conventional tensor methods

    Generative Models of Biological Variations in Bulk and Single-cell RNA-seq

    Get PDF
    The explosive growth of next-generation sequencing data enhances our ability to understand biological process at an unprecedented resolution. Meanwhile organizing and utilizing this tremendous amount of data becomes a big challenge. High-throughput technology provides us a snapshot of all underlying biological activities, but this kind of extremely high-dimensional data is hard to interpret. Due to the curse of dimensionality, the measurement is sparse and far from enough to shape the actual manifold in the high-dimensional space. On the other hand, the measurements may contain structured noise such as technical or nuisance biological variation which can interfere downstream interpretation. Generative modeling is a powerful tool to make sense of the data and generate compact representations summarizing the embedded biological information. This thesis introduces three generative models that help amplifying biological signals buried in the noisy bulk and single-cell RNA-seq data. In Chapter 2, we propose a semi-supervised deconvolution framework called PLIER which can identify regulations in cell-type proportions and specific pathways that control gene expression. PLIER has inspired the development of MultiPLIER and has been used to infer context-specific genotype effects in the brain. In Chapter 3, we construct a supervised transformation named DataRemix to normalize bulk gene expression profiles in order to maximize the biological findings with respect to a variety of downstream tasks. By reweighing the contribution of hidden factors, we are able to reveal the hidden biological signals without any external dataset-specific knowledge. We apply DataRemix to the ROSMAP dataset and report the first replicable trans-eQTL effect in human brain. In Chapter 4, we focus on scRNA-seq and introduce NIFA which is an unsupervised decomposition framework that combines the desired properties of PCA, ICA and NMF. It simultaneously models uni- and multi-modal factors isolating discrete cell-type identity and continuous pathway-level variations into separate components. The work presented in Chapter 2 has been published as a journal article. The work in Chapter 3 and Chapter 4 are under submission and they are available as preprints on bioRxiv
    corecore