5 research outputs found

    Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology.

    Get PDF
    Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future

    Modelling at the transcriptome - proteome interface

    No full text
    In high-throughput experimental biology, it is widely acknowledged that mRNA expression levels and the corresponding protein abundances are jointly analysed to observe the relationship between these two omic measurements. While some experiments have shown a good correlation between transcriptome and proteome for some species under different conditions, such correlation values are not universal due to post-transcriptional and post-translational regulations. Thus, bridging the gap between transcriptome and proteome measurements allow us to uncover useful biological insights of the above regulations which are important to study on protein generation process and several disease conditions. We develop a data-driven predictor using transcriptome layer properties as proxies to protein abundance and employ the model in a novel manner to detect posttranslationally regulated proteins, hypothesizing that model failures (outlier proteins) occur due to protein stability disruption by post-translational modifications (PTMs). Three outlier detection techniques were employed with our protein abundance predictor to detect post-translationally regulated protein. Those are; (1) simple linear regression model which detects outliers by looking at the predicted and the measured protein scatter plot, (2) Outlier Rejecting Regression (ORR) model, a novel mathematical formulation which returns user-specific fraction of the data as outliers by solving a non-convex optimization problem using Difference of Convex functions Algorithm (DCA) and (3) Quantile Regression (QR) which employs an asymmetric loss model to detect outliers only with negative losses for the first time in omic world. Proteins extracted as outliers using above techniques confirmed our hypothesis on post-translational regulation (PTR) by providing high statistical confidence for functional annotations and pathway information. Therefore, this data-driven framework can be used as a reliable technique for biologists to reduce laboratory experimental workspace in detecting post-translationally regulated proteins.We also perform a thorough inference analysis on most commonly used high-throughput microarray and RNA-Seq measurements using several machine learning inference techniques to observe whether their high numerical precision provides additional information about the gene with respect to the binary representation of gene switch on/off status. We perform this analysis at the transcriptome level and as well at the proteome level as an extended experimental setting of our PTR detection framework. These analyses suggest that binarized mRNA concentrations, which are measured using high-throughput RNA-Seq and microarray technologies are sufficient to perform accurate machine learning inferences similar to continuous measurements, not only at the transcriptome level but also at the proteome level to predict protein abundance and to detect protein with post-translation regulation to a high confidence level

    Bridging the gap between transcriptome and proteome measurements identifies post-translationally regulated genes

    No full text
    Abstract Motivation: Despite much dynamical cellular behaviour being achieved by accurate regulation of protein concentrations, messenger RNA abundances, measured by microarray technology, and more recently by deep sequencing techniques, are widely used as proxies for protein measurements. Although for some species and under some conditions, there is good correlation between transcriptome and proteome level measurements, such correlation is by no means universal due to post-transcriptional and post-translational regulation, both of which are highly prevalent in cells. Here, we seek to develop a data-driven machine learning approach to bridging the gap between these two levels of high-throughput omic measurements on Saccharomyces cerevisiae and deploy the model in a novel way to uncover mRNA-protein pairs that are candidates for post-translational regulation. Results: The application of feature selection by sparsity inducing regression (l1 norm regularization) leads to a stable set of features: i.e. mRNA, ribosomal occupancy, ribosome density, tRNA adaptation index and codon bias while achieving a feature reduction from 37 to 5. A linear predictor used with these features is capable of predicting protein concentrations fairly accurately (). Proteins whose concentration cannot be predicted accurately, taken as outliers with respect to the predictor, are shown to have annotation evidence of post-translational modification, significantly more than random subsets of similar size . In a data mining sense, this work also shows a wider point that outliers with respect to a learning method can carry meaningful information about a problem domain. Contact:  [email protected]</jats:p

    Enhancing the biological relevance of machine learning classifiers for reverse vaccinology

    No full text
    Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future
    corecore