Abstract Background Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source. Results It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at <url>http://www.cis.hut.fi/projects/mi/software/drCCA/</url>. Conclusion We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p

Abhishek Tripathi

AP Gasch

Arto Klami

G Dennis

GR Lanckriet

H Hotelling

HC Causton

J Kettenring

J Nikkilä

JA Berger

JDR Farquhar

M Girolami

ME Ross

PT Spellman

Samuel Kaski

Y Yamanishi

English

PubMed

Springer - Publisher Connector

Simple integrative preprocessing preserves what is shared in data sources

Abstract Background Bioinformatics data analysis toolbox needs general-purpose, fast and easily interpretable preprocessing tools that perform data integration during exploratory data analysis. Our focus is on vector-valued data sources, each consisting of measurements of the same entity but on different variables, and on tasks where source-specific variation is considered noisy or not interesting. Principal components analysis of all sources combined together is an obvious choice if it is not important to distinguish between data source-specific and shared variation. Canonical Correlation Analysis (CCA) focuses on mutual dependencies and discards source-specific "noise" but it produces a separate set of components for each source. Results It turns out that components given by CCA can be combined easily to produce a linear and hence fast and easily interpretable feature extraction method. The method fuses together several sources, such that the properties they share are preserved. Source-specific variation is discarded as uninteresting. We give the details and implement them in a software tool. The method is demonstrated on gene expression measurements in three case studies: classification of cell cycle regulated genes in yeast, identification of differentially expressed genes in leukemia, and defining stress response in yeast. The software package is available at http://www.cis.hut.fi/projects/mi/software/drCCA/. Conclusion We introduced a method for the task of data fusion for exploratory data analysis, when statistical dependencies between the sources and not within a source are interesting. The method uses canonical correlation analysis in a new way for dimensionality reduction, and inherits its good properties of being simple, fast, and easily interpretable as a linear projection.</p

Klami Arto

Tripathi Abhishek

Kaski Samuel

Directory of Open Access Journals

BMC Bioinformatics

Crossref

Canonical analysis of several sets of variables. Biometrika

Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood

Data Integration for Classification Problems Employing Gaussian Process Priors.

ES, Young RA: Remodeling of Yeast Genome Expression in Response to Environmental Changes. Molecular Biology of the Cell

Extraction of Correlated Gene Clusters from Multiple Genomic Data by Generalized Kernel Canonical Correlation Analysis. Bioinformatics

Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell

Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell

Jointly analyzing gene expression and copy number data in breast cancer using data reduction models.

Relations between two sets of variates. Biometrika

S: Explorative modeling of yeast stress response and its regulation with gCCA and associative clustering.

Szedmak S: Two view learning: SVM-2K, Theory and Practice.

WS: A statistical framework for genomic data fusion. Bioinformatics

file:///data/core-remote/dit/data/Springer-OA/pdf/2d1/aHR0cDovL2xpbmsuc3ByaW5nZXIuY29tLzEwLjExODYvMTQ3MS0yMTA1LTktMTExLnBkZg==.pdf

Simple integrative preprocessing preserves what is shared in data sources

Abstract

Similar works

Full text

Available Versions

Springer - Publisher Connector

Springer - Publisher Connector

Directory of Open Access Journals

Crossref