Abstract Background Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. Results We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. Conclusion SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful. <p/

Jung, Segun

Showe, Louise C

Showe, Michael K

Yousef, Malik

English

PubMed

Abstract Background Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. Results We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights. Conclusion SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups. Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful. <p/

Showe Louise C

Jung Segun

Yousef Malik

Showe Michael K

Directory of Open Access Journals

BMC Bioinformatics

Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

Springer - Publisher Connector

A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics

A knowledge-driven approach to cluster validity assessment. Bioinformatics

A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology

AndPd, Braga JoP: SVM-KM: speeding SVMs learning with a priori cluster selection and k-means.

Azuaje F: Multiple SVM-RFE for gene selection in cancer classification with expression data.

BagBoosting for Tumor Classification with Gene Expression Data [http://stat.ethz.ch/~dettling/bagboost.html]

biomedical literature for protein-protein interactions using a support vector machine.

Buhlmann P: Supervised clustering of genes.

Cerrolaza AJ: Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine

circulating tumor cells with 90% accuracy. Blood

Cluster analysis and display of genome-wide expression patterns. PNAS

Clustering threshold gradient descent regularization: with applications to microarray studies. Bioinformatics

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data.

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning

Convolution kernels on discrete structures. In

Efficient Feature Selection via Analysis of Relevance and Redundancy.

Entropy-based gene ranking without selection bias for the predictive classification of microarray data.

Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology

FP: Judging the Quality of Gene ExpressionBased Clustering Methods Using Gene Annotation. Genome Res

Gene Expression Profiling Allows Distinction between Primary and Metastatic Squamous Cell Carcinomas in the Lung. Cancer Res

Gene extraction for cancer diagnosis by support vector machines – An improvement.

Gene Feature Selection Algorithm for Leukemia Classification from Microarray Gene Expression Data.

Gene Selection for Cancer Classification using Support Vector Machines, Machine Learning. Machine Learning

GH: Wrappers for feature subset selection.

Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. PNAS

Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics

King RD: How well do we understand the clusters found in microarray data?

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Mewes HW: Gene selection from microarray data for cancer classification – a machine learning approach. Computational Biology and Chemistry

Molecular Classification of Cancer : Class Discovery and Class Prediction by Gene Expression Monitoring. Science

Multivariate Observations

Pathway analysis using random forests classification and regression. Bioinformatics

Proteomic cancer classification with mass spectra data.

Raftery AE: How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis. The Computer Journal

Scoring clustering solutions by their biological relevance. Bioinformatics

Selection of informative clusters from hierarchical cluster tree with gene classes.

Singhal Sunil, Alila Linda, Elliot Wakeam, Ruth Muschel, Powell A Charles,

Some methods for classification and analysis of multivariate observations.

Support vector machines based on Kmeans clustering for real-time business intelligence systems.

The Nature of Statistical Learning

Tibshirani R: Penalized discriminant analysis. Annals of Statistics

Unlabeled data classification via support vector machines and k-means clustering.

WN: Gene functional classification from heterogeneous data.

Wong W: Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.

WS: Support vector machine classification on the web. Bioinformatics

X: A fast SVM training algorithm based on the set segmentation and k-means clustering.

Yang Y: Analysis of recursive gene selection approaches from microarray data. Bioinformatics

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1877816

Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

Abstract

Similar works

Full text

Available Versions

Directory of Open Access Journals

Springer - Publisher Connector