27 research outputs found

    Cross-platform comparison and visualisation of gene expression data using co-inertia analysis

    Get PDF
    BACKGROUND: Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses. RESULTS: We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified. CONCLUSIONS: CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets

    Integrated Analysis of Multiple Microarray Datasets Identifies a Reproducible Survival Predictor in Ovarian Cancer

    Get PDF
    BACKGROUND: Public data integration may help overcome challenges in clinical implementation of microarray profiles. We integrated several ovarian cancer datasets to identify a reproducible predictor of survival. METHODOLOGY/PRINCIPAL FINDINGS: Four microarray datasets from different institutions comprising 265 advanced stage tumors were uniformly reprocessed into a single training dataset, also adjusting for inter-laboratory variation ("batch-effect"). Supervised principal component survival analysis was employed to identify prognostic models. Models were independently validated in a 61-patient cohort using a custom array genechip and a publicly available 229-array dataset. Molecular correspondence of high- and low-risk outcome groups between training and validation datasets was demonstrated using Subclass Mapping. Previously established molecular phenotypes in the 2(nd) validation set were correlated with high and low-risk outcome groups. Functional representational and pathway analysis was used to explore gene networks associated with high and low risk phenotypes. A 19-gene model showed optimal performance in the training set (median OS 31 and 78 months, p < 0.01), 1(st) validation set (median OS 32 months versus not-yet-reached, p = 0.026) and 2(nd) validation set (median OS 43 versus 61 months, p = 0.013) maintaining independent prognostic power in multivariate analysis. There was strong molecular correspondence of the respective high- and low-risk tumors between training and 1(st) validation set. Low and high-risk tumors were enriched for favorable and unfavorable molecular subtypes and pathways, previously defined in the public 2(nd) validation set. CONCLUSIONS/SIGNIFICANCE: Integration of previously generated cancer microarray datasets may lead to robust and widely applicable survival predictors. These predictors are not simply a compilation of prognostic genes but appear to track true molecular phenotypes of good- and poor-outcome

    A sparse PLS for variable selection when integrating omics data

    Get PDF
    Recent biotechnology advances allow for multiple types of omics data, such as transcriptomic, proteomic or metabolomic data sets to be integrated. The problem of feature selection has been addressed several times in the context of classification, but needs to be handled in a specific manner when integrating data. In this study, we focus on the integration of two-block data that are measured on the same samples. Our goal is to combine integration and simultaneous variable selection of the two data sets in a one-step procedure using a Partial Least Squares regression (PLS) variant to facilitate the biologists' interpretation. A novel computational methodology called "sparse PLS" is introduced for a predictive analysis to deal with these newly arisen problems. The sparsity of our approach is achieved with a Lasso penalization of the PLS loading vectors when computing the Singular Value Decomposition. Sparse PLS is shown to be effective and biologically meaningful. Comparisons with classical PLS are performed on a simulated data set and on real data sets. On one data set, a thorough biological interpretation of the obtained results is provided. We show that sparse PLS provides a valuable variable selection tool for highly dimensional data sets. Copyright ©2008 The Berkeley Electronic Press. All rights reserved

    Systematic Interpretation of High-Throughput Biological Data

    Get PDF

    Dissimilarity functions for rank-invariant hierarchical clustering of continuous variables

    Get PDF
    none3siA theoretical framework is presented for a (copula-based) notion of dissimilarity between continuous random vectors and its main properties are studied. The proposed dissimilarity assigns the smallest value to a pair of random vectors that are comonotonic. Various properties of this dissimilarity are studied, with special attention to those that are prone to the hierarchical agglomerative methods, such as reducibility. Some insights are provided for the use of such a measure in clustering algorithms and a simulation study is presented. Real case studies illustrate the main features of the whole methodology.openFuchs S.; Di Lascio F.M.L.; Durante F.Fuchs, S.; Di Lascio, F. M. L.; Durante, F

    A primer on correlation-based dimension reduction methods for multi-omics analysis

    Full text link
    The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table

    Dissimilarity functions for rank-invariant hierarchical clustering of continuous variables

    Get PDF
    A theoretical framework is presented for a (copula-based) notion of dissimilarity between continuous random vectors and its main properties are studied. The proposed dissimilarity assigns the smallest value to a pair of random vectors that are comonotonic. Various properties of this dissimilarity are studied, with special attention to those that are prone to the hierarchical agglomerative methods, such as reducibility. Some insights are provided for the use of such a measure in clustering algorithms and a simulation study is presented. Real case studies illustrate the main features of the whole methodology.Comment: 38 pages, 10 figures, 7 table
    corecore