Search CORE

922 research outputs found

Nonnegative principal component analysis for mass spectral serum profiles and biomarker discovery

Author: D Mantini
E Petricoin
Henry Han
HW Ressom
J Nocedal
JS Yu
KR Coombes
M Gonen
M Hauskrecht
P Hoyer
R Lilien
R Zass
T Alexandrov
V Vapnik
X Han
X Han
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background As a novel cancer diagnostic paradigm, mass spectroscopic serum proteomic pattern diagnostics was reported superior to the conventional serologic cancer biomarkers. However, its clinical use is not fully validated yet. An important factor to prevent this young technology to become a mainstream cancer diagnostic paradigm is that robustly identifying cancer molecular patterns from high-dimensional protein expression data is still a challenge in machine learning and oncology research. As a well-established dimension reduction technique, PCA is widely integrated in pattern recognition analysis to discover cancer molecular patterns. However, its global feature selection mechanism prevents it from capturing local features. This may lead to difficulty in achieving high-performance proteomic pattern discovery, because only features interpreting global data behavior are used to train a learning machine. Methods In this study, we develop a nonnegative principal component analysis algorithm and present a nonnegative principal component analysis based support vector machine algorithm with sparse coding to conduct a high-performance proteomic pattern classification. Moreover, we also propose a nonnegative principal component analysis based filter-wrapper biomarker capturing algorithm for mass spectral serum profiles. Results We demonstrate the superiority of the proposed algorithm by comparison with six peer algorithms on four benchmark datasets. Moreover, we illustrate that nonnegative principal component analysis can be effectively used to capture meaningful biomarkers. Conclusion Our analysis suggests that nonnegative principal component analysis effectively conduct local feature selection for mass spectral profiles and contribute to improving sensitivities and specificities in the following classification, and meaningful biomarker discovery.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Multi-resolution independent component analysis for high-performance tumor classification and biomarker discovery

Author: A Hyvärinen
A Martinez
B Schölkopf
BJ Boersma
CL Nutt
D Lee
D Singh
F Bach
Henry Han
I Jolliffe
J Brunet
K Milde-Langosch
K Yu
L van’t Veer
M Lacroix
N Holtkamp
N Iizuka
S Langer
S Mallat
V Vapnik
X Han
X Zhou
Xiao-Li Li
Y Gao
Y Wang
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Although high-throughput microarray based molecular diagnostic technologies show a great promise in cancer diagnosis, it is still far from a clinical application due to its low and instable sensitivities and specificities in cancer molecular pattern recognition. In fact, high-dimensional and heterogeneous tumor profiles challenge current machine learning methodologies for its small number of samples and large or even huge number of variables (genes). This naturally calls for the use of an effective feature selection in microarray data classification. Methods We propose a novel feature selection method: multi-resolution independent component analysis (MICA) for large-scale gene expression data. This method overcomes the weak points of the widely used transform-based feature selection methods such as principal component analysis (PCA), independent component analysis (ICA), and nonnegative matrix factorization (NMF) by avoiding their global feature-selection mechanism. In addition to demonstrating the effectiveness of the multi-resolution independent component analysis in meaningful biomarker discovery, we present a multi-resolution independent component analysis based support vector machines (MICA-SVM) and linear discriminant analysis (MICA-LDA) to attain high-performance classifications in low-dimensional spaces. Results We have demonstrated the superiority and stability of our algorithms by performing comprehensive experimental comparisons with nine state-of-the-art algorithms on six high-dimensional heterogeneous profiles under cross validations. Our classification algorithms, especially, MICA-SVM, not only accomplish clinical or near-clinical level sensitivities and specificities, but also show strong performance stability over its peers in classification. Software that implements the major algorithm and data sets on which this paper focuses are freely available at <url>https://sites.google.com/site/heyaumapbc2011/</url>. Conclusions This work suggests a new direction to accelerate microarray technologies into a clinical routine through building a high-performance classifier to attain clinical-level sensitivities and specificities by treating an input profile as a ‘profile-biomarker’. The multi-resolution data analysis based redundant global feature suppressing and effective local feature extraction also have a positive impact on large scale ‘omics’ data mining.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Sparse integrative clustering of multiple omics data sets

Author: Mo Qianxing
Shen Ronglai
Wang Sijian
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 13/02/2012
Field of study

High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling approach measures multiple omics data types simultaneously in the same set of biological samples. Such approach renders an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996) 267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91-108] methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design [Monographs on Statistics and Applied Probability (1994) Chapman & Hall] is used to seek "experimental" points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic and transcriptomic data for subtype analysis in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

PubMed Central

Collection Of Biostatistics Research Archive

Understanding Protein-Ligand Interactions Using Simulated Annealing in Dimensionally Reduced Fingerprint Representation

Author: Ravi K. Nandigam
Sangtae Kim
Publication venue: 'IntechOpen'
Publication date: 28/02/2011
Field of study

IntechOpen

Clustering by non-negative matrix factorization with independent principal component initialization

Author: Gong Liyun
K. Nandi Asoke
Publication venue
Publication date: 09/09/2013
Field of study

Non negative matrix factorization (NMF) is a dimensionality reduction and clustering method, and has been applied to many areas such as bioinformatics, face images classification, and so on. Based on the traditional NMF, researchers recently have put forward several new algorithms on the initialization area to improve its performance. In this paper, we explore the clustering performance of the NMF algorithm, with emphasis on the initialization problem. We propose an initialization method based on independent principal component analysis (IPCA) for NMF. The experiments were carried out on the four real datasets and the results showed that the IPCA-based initialization of NMF gets better clustering of the datasets compared with both random and PCA-based initializations

University of Lincoln Institutional Repository