4,675 research outputs found

    Novel Pattern Recognition Approaches to Identification of Gene-Expression Pathways in Banana Cultivars

    Get PDF
    Bolstered resubstitution is a simple and fast error estimation method that has been shown to perform better than cross-validation and comparably with bootstrap in small-sample settings. However, it has been observed that its performance can deteriorate in high-dimensional feature spaces. To overcome this issue, we propose here a modification of bolstered error estimation based on the principle of Naive Bayes. This estimator is simple to compute and is reducible under feature selection. In experiments using popular classification rules applied to data from a well-known breast cancer gene expression study, the new Naive-Bayes bolstered estimator outperformed the old one, as well as cross-validation and resubstitution, in high-dimensional target feature spaces (after feature selection); it was superior to the 0.632 bootstrap provided that the sample size was not too small. Model selection is the task of choosing a model with optimal complexity for the given data set. Most model selection criteria try to minimize the sum of a training error term and a complexity control term, that is, minimize the complexity penalized loss. We investigate replacing the training error with bolstered resubstitution in the penalized loss to do model selection. Computer simulations indicate that the proposed method improves the performance of the model selection in terms of choosing the correct model complexity. Besides applying novel error estimation to model selection in pattern recognition, we also apply it to assess the performance of classifiers designed on the banana gene-expression data. Bananas are the world's most important fruit; they are a vital component of local diets in many countries. Diseases and drought are major threats in banana production. To generate disease and drought tolerant bananas, we need to identify disease and drought responsive genes and pathways. Towards this goal, we conducted RNA-Seq analysis with wild type and transgenic banana, with and without inoculation/drought stress, and on different days after applying the stress. By combining several state-of-the-art computational models, we identified stress responsive genes and pathways. The validation results of these genes in Arabidopsis are promising

    A feature selection approach for identification of signature genes from SAGE data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements.</p> <p>Results</p> <p>A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology.</p> <p>Conclusion</p> <p>The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.</p

    Biomarker Discovery and Validation for Proteomics and Genomics: Modeling And Systematic Analysis

    Get PDF
    Discovery and validation of protein biomarkers with high specificity is the main challenge of current proteomics studies. Different mass spectrometry models are used as shotgun tools for discovery of biomarkers which is usually done on a small number of samples. In the discovery phase, feature selection plays a key role. The first part of this work focuses on the feature selection problem and proposes a new Branch and Bound algorithm based on U-curve assumption. The U-curve branch-and-bound algorithm (UBB) for optimization was introduced recently by Barrera and collaborators. In this work we introduce an improved algorithm (IUBB) for finding the optimal set of features based on the U-curve assumption. The results for a set of U-curve problems, generated from a cost model, show that the IUBB algorithm makes fewer evaluations and is more robust than the original UBB algorithm. The two algorithms are also compared in finding the optimal features of a real classification problem designed using the data model. The results show that IUBB outperforms UBB in finding the optimal feature sets. On the other hand, the result indicate that the performance of the error estimator is crucial to the success of the feature selection algorithm. To study the effect of error estimation methods, in the next section of the work, we study the effect of the complexity of the decision boundary on the performance of error estimation methods. First, a model is developed which quantifies the complexity of a classification problem purely in terms of the geometry of the decision boundary, without relying on the Bayes error. Then, this model is used in a simulation study to analyze the bias and root-mean-square error (RMS) of a few widely used error estimation methods relative to the complexity of the decision boundary. The results show that all the estimation methods lose accuracy as complexity increases. Validation of a set of selected biomarkers from a list of candidates is an important stage in the biomarker identification pipeline and is the focus of the the next section of this work. This section analyzes the Selected Reaction Monitoring (SRM) pipeline in a systematic fashion, by modelling the main stages of the biomarker validation process. The proposed models for SRM and protein mixture are then used to study the effect of different parameters on the final performance of biomarker validation. We focus on the sensitivity of the SRM pipeline to the working parameters, in order to identify the bottlenecks where time and energy should be spent in designing the experiment

    Reliable Classifier to Differentiate Primary and Secondary Acute Dengue Infection Based on IgG ELISA

    Get PDF
    Dengue virus infection causes a wide spectrum of illness, ranging from sub-clinical to severe disease. Severe dengue is associated with sequential viral infections. A strict definition of primary versus secondary dengue infections requires a combination of several tests performed at different stages of the disease, which is not practical.We developed a simple method to classify dengue infections as primary or secondary based on the levels of dengue-specific IgG. A group of 109 dengue infection patients were classified as having primary or secondary dengue infection on the basis of a strict combination of results from assays of antigen-specific IgM and IgG, isolation of virus and detection of the viral genome by PCR tests performed on multiple samples, collected from each patient over a period of 30 days. The dengue-specific IgG levels of all samples from 59 of the patients were analyzed by linear discriminant analysis (LDA), and one- and two-dimensional classifiers were designed. The one-dimensional classifier was estimated by bolstered resubstitution error estimation to have 75.1% sensitivity and 92.5% specificity. The two-dimensional classifier was designed by taking also into consideration the number of days after the onset of symptoms, with an estimated sensitivity and specificity of 91.64% and 92.46%. The performance of the two-dimensional classifier was validated using an independent test set of standard samples from the remaining 50 patients. The classifications of the independent set of samples determined by the two-dimensional classifiers were further validated by comparing with two other dengue classification methods: hemagglutination inhibition (HI) assay and an in-house anti-dengue IgG-capture ELISA method. The decisions made with the two-dimensional classifier were in 100% accordance with the HI assay and 96% with the in-house ELISA.Once acute dengue infection has been determined, a 2-D classifier based on common dengue virus IgG kits can reliably distinguish primary and secondary dengue infections. Software for calculation and validation of the 2-D classifier is made available for download
    • …
    corecore