832 research outputs found

    Novel Pattern Recognition Approaches to Identification of Gene-Expression Pathways in Banana Cultivars

    Get PDF
    Bolstered resubstitution is a simple and fast error estimation method that has been shown to perform better than cross-validation and comparably with bootstrap in small-sample settings. However, it has been observed that its performance can deteriorate in high-dimensional feature spaces. To overcome this issue, we propose here a modification of bolstered error estimation based on the principle of Naive Bayes. This estimator is simple to compute and is reducible under feature selection. In experiments using popular classification rules applied to data from a well-known breast cancer gene expression study, the new Naive-Bayes bolstered estimator outperformed the old one, as well as cross-validation and resubstitution, in high-dimensional target feature spaces (after feature selection); it was superior to the 0.632 bootstrap provided that the sample size was not too small. Model selection is the task of choosing a model with optimal complexity for the given data set. Most model selection criteria try to minimize the sum of a training error term and a complexity control term, that is, minimize the complexity penalized loss. We investigate replacing the training error with bolstered resubstitution in the penalized loss to do model selection. Computer simulations indicate that the proposed method improves the performance of the model selection in terms of choosing the correct model complexity. Besides applying novel error estimation to model selection in pattern recognition, we also apply it to assess the performance of classifiers designed on the banana gene-expression data. Bananas are the world's most important fruit; they are a vital component of local diets in many countries. Diseases and drought are major threats in banana production. To generate disease and drought tolerant bananas, we need to identify disease and drought responsive genes and pathways. Towards this goal, we conducted RNA-Seq analysis with wild type and transgenic banana, with and without inoculation/drought stress, and on different days after applying the stress. By combining several state-of-the-art computational models, we identified stress responsive genes and pathways. The validation results of these genes in Arabidopsis are promising

    Which Is Better: Holdout or Full-Sample Classifier Design?

    Get PDF
    Is it better to design a classifier and estimate its error on the full sample or to design a classifier on a training subset and estimate its error on the holdout test subset? Full-sample design provides the better classifier; nevertheless, one might choose holdout with the hope of better error estimation. A conservative criterion to decide the best course is to aim at a classifier whose error is less than a given bound. Then the choice between full-sample and holdout designs depends on which possesses the smaller expected bound. Using this criterion, we examine the choice between holdout and several full-sample error estimators using covariance models and a patient-data model. Full-sample design consistently outperforms holdout design. The relation between the two designs is revealed via a decomposition of the expected bound into the sum of the expected true error and the expected conditional standard deviation of the true error

    Biomarker Discovery and Validation for Proteomics and Genomics: Modeling And Systematic Analysis

    Get PDF
    Discovery and validation of protein biomarkers with high specificity is the main challenge of current proteomics studies. Different mass spectrometry models are used as shotgun tools for discovery of biomarkers which is usually done on a small number of samples. In the discovery phase, feature selection plays a key role. The first part of this work focuses on the feature selection problem and proposes a new Branch and Bound algorithm based on U-curve assumption. The U-curve branch-and-bound algorithm (UBB) for optimization was introduced recently by Barrera and collaborators. In this work we introduce an improved algorithm (IUBB) for finding the optimal set of features based on the U-curve assumption. The results for a set of U-curve problems, generated from a cost model, show that the IUBB algorithm makes fewer evaluations and is more robust than the original UBB algorithm. The two algorithms are also compared in finding the optimal features of a real classification problem designed using the data model. The results show that IUBB outperforms UBB in finding the optimal feature sets. On the other hand, the result indicate that the performance of the error estimator is crucial to the success of the feature selection algorithm. To study the effect of error estimation methods, in the next section of the work, we study the effect of the complexity of the decision boundary on the performance of error estimation methods. First, a model is developed which quantifies the complexity of a classification problem purely in terms of the geometry of the decision boundary, without relying on the Bayes error. Then, this model is used in a simulation study to analyze the bias and root-mean-square error (RMS) of a few widely used error estimation methods relative to the complexity of the decision boundary. The results show that all the estimation methods lose accuracy as complexity increases. Validation of a set of selected biomarkers from a list of candidates is an important stage in the biomarker identification pipeline and is the focus of the the next section of this work. This section analyzes the Selected Reaction Monitoring (SRM) pipeline in a systematic fashion, by modelling the main stages of the biomarker validation process. The proposed models for SRM and protein mixture are then used to study the effect of different parameters on the final performance of biomarker validation. We focus on the sensitivity of the SRM pipeline to the working parameters, in order to identify the bottlenecks where time and energy should be spent in designing the experiment

    Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

    Get PDF
    <p/> <p>Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the <inline-formula><graphic file="1687-4153-2007-16354-i1.gif"/></inline-formula>-test for feature selection; and <inline-formula><graphic file="1687-4153-2007-16354-i2.gif"/></inline-formula>-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.</p

    A feature selection approach for identification of signature genes from SAGE data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements.</p> <p>Results</p> <p>A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology.</p> <p>Conclusion</p> <p>The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.</p

    Small sample feature selection

    Get PDF
    High-throughput technologies for rapid measurement of vast numbers of biolog- ical variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for fea- ture selection, while at the same time making feature-selection algorithms less reliable. Feature selection is required to avoid overfitting, and the combinatorial nature of the problem demands a suboptimal feature-selection algorithm. In this dissertation, we have found that feature selection is problematic in small- sample settings via three different approaches. First we examined the feature-ranking performance of several kinds of error estimators for different classification rules, by considering all feature subsets and using 2 measures of performance. The results show that their ranking is strongly affected by inaccurate error estimation. Secondly, since enumerating all feature subsets is computationally impossible in practice, a suboptimal feature-selection algorithm is often employed to find from a large set of potential features a small subset with which to classify the samples. If error estimation is required for a feature-selection algorithm, then the impact of error estimation can be greater than the choice of algorithm. Lastly, we took a regression approach by comparing the classification errors for the optimal feature sets and the errors for the feature sets found by feature-selection algorithms. Our study shows that it is unlikely that feature selection will yield a feature set whose error is close to that of the optimal feature set, and the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist
    corecore