19 research outputs found

    A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

    Get PDF
    BACKGROUND: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently the claim is made that genes are co-regulated (due to pathway dependencies) and that multivariate approaches are therefore per definition more desirable than univariate selection approaches. Based on the published performances of all these approaches a fair comparison of the available results can not be made. This mainly stems from two factors. First, the results are often biased, since the validation set is in one way or another involved in training the predictor, resulting in optimistically biased performance estimates. Second, the published results are often based on a small number of relatively simple datasets. Consequently no generally applicable conclusions can be drawn. RESULTS: In this study we adopted an unbiased protocol to perform a fair comparison of frequently used multivariate and univariate gene selection techniques, in combination with a rÀnge of classifiers. Our conclusions are based on seven gene expression datasets, across several cancer types. CONCLUSION: Our experiments illustrate that, contrary to several previous studies, in five of the seven datasets univariate selection approaches yield consistently better results than multivariate approaches. The simplest multivariate selection approach, the Top Scoring method, achieves the best results on the remaining two datasets. We conclude that the correlation structures, if present, are difficult to extract due to the small number of samples, and that consequently, overly-complex gene selection algorithms that attempt to extract these structures are prone to overtraining

    The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

    Get PDF
    Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

    Approaches to working in high-dimensional data spaces: gene expression microarrays

    Get PDF
    This review provides a focused summary of the implications of high-dimensional data spaces produced by gene expression microarrays for building better models of cancer diagnosis, prognosis, and therapeutics. We identify the unique challenges posed by high dimensionality to highlight methodological problems and discuss recent methods in predictive classification, unsupervised subclass discovery, and marker identification

    Microarray-based cancer prediction using single genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Although numerous methods of using microarray data analysis for cancer classification have been proposed, most utilize many genes to achieve accurate classification. This can hamper interpretability of the models and ease of translation to other assay platforms. We explored the use of single genes to construct classification models. We first identified the genes with the most powerful univariate class discrimination ability and then constructed simple classification rules for class prediction using the single genes.</p> <p>Results</p> <p>We applied our model development algorithm to eleven cancer gene expression datasets and compared classification accuracy to that for standard methods including Diagonal Linear Discriminant Analysis, <it>k</it>-Nearest Neighbor, Support Vector Machine and Random Forest. The single gene classifiers provided classification accuracy comparable to or better than those obtained by existing methods in most cases. We analyzed the factors that determined when simple single gene classification is effective and when more complex modeling is warranted.</p> <p>Conclusions</p> <p>For most of the datasets examined, the single-gene classification methods appear to work as well as more standard methods, suggesting that simple models could perform well in microarray-based cancer prediction.</p

    Gene selection and classification in autism gene expression data

    Get PDF
    Autism spectrum disorders (ASD) are neurodevelopmental disorders that are currently diagnosed on the basis of abnormal stereotyped behaviour as well as observable deficits in communication and social functioning. Although a variety of candidate genes have been attributed to the disorder, no single gene is applicable to more than 1–2% of the general ASD population. Despite extensive efforts, definitive genes that contribute to autism susceptibility have yet to be identified. The major problems in dealing with the gene expression dataset of autism include the presence of limited number of samples and large noises due to errors of experimental measurements and natural variation. In this study, a systematic combination of three important filters, namely t-test (TT), Wilcoxon Rank Sum (WRS) and Feature Correlation (COR) are applied along with efficient wrapper algorithm based on geometric binary particle swarm optimization-support vector machine (GBPSO-SVM), aiming at selecting and classifying the most attributed genes of autism. A new approach based on the criterion of median ratio, mean ratio and variance deviations is also applied to reduce the initial dataset prior to its involvement. Results showed that the most discriminative genes that were identified in the first and last selection steps concluded the presence of a repetitive gene (CAPS2), which was assigned as the most ASD risk gene. The fused result of genes subset that were selected by the GBPSO-SVM algorithm increased the classification accuracy to about 92.10%, which is higher than those reported in literature for the same autism dataset. Noticeably, the application of ensemble using random forest (RF) showed better performance compared to that of previous studies. However, the ensemble approach based on the employment of SVM as an integrator of the fused genes from the output branches of GBPSO-SVM outperformed the RF integrator. The overall improvement was ascribed to the selection strategies that were taken to reduce the dataset and the utilization of efficient wrapper based GBPSO-SVM algorithm

    A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification

    Get PDF
    Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies. © 2006 IEEE.published_or_final_versio

    Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

    Get PDF
    Introduction: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem

    Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

    Get PDF
    &lt;b&gt;Background&lt;/b&gt; The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.&lt;p&gt;&lt;/p&gt; &lt;b&gt;Results&lt;/b&gt; We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets.&lt;p&gt;&lt;/p&gt; &lt;b&gt;Conclusions&lt;/b&gt; The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis

    Small patient datasets reveal genetic drivers of non-small cell lung cancer subtypes using machine learning for hypothesis generation

    Get PDF
    Aim: Many small datasets of significant value exist in the medical space that are being underutilized. Due to the heterogeneity of complex disorders found in oncology, systems capable of discovering patient subpopulations while elucidating etiologies are of great value as they can indicate leads for innovative drug discovery and development. Methods: Two small non-small cell lung cancer (NSCLC) datasets (GSE18842 and GSE10245) consisting of 58 samples of adenocarcinoma (ADC) and 45 samples of squamous cell carcinoma (SCC) were used in a machine intelligence framework to identify genetic biomarkers differentiating these two subtypes. Utilizing a set of standard machine learning (ML) methods, subpopulations of ADC and SCC were uncovered while simultaneously extracting which genes, in combination, were significantly involved in defining the subpopulations. A previously described interactive hypothesis-generating method designed to work with ML methods was employed to provide an alternative way of extracting the most important combination of variables to construct a new data set. Results: Several genes were uncovered that were previously implicated by other methods. This framework accurately discovered known subpopulations, such as genetic drivers associated with differing levels of aggressiveness within the SCC and ADC subtypes. Furthermore, phyosphatidylinositol glycan anchor biosynthesis, class X (PIGX) was a novel gene implicated in this study that warrants further investigation due to its role in breast cancer proliferation. Conclusions: The ability to learn from small datasets was highlighted and revealed well-established properties of NSCLC. This showcases the utility of ML techniques to reveal potential genes of interest, even from small datasets, shedding light on novel driving factors behind subpopulations of patients
    corecore