31 research outputs found

    Prediction of glycosylation sites using random forests

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Post translational modifications (PTMs) occur in the vast majority of proteins and are essential for function. Prediction of the sequence location of PTMs enhances the functional characterisation of proteins. Glycosylation is one type of PTM, and is implicated in protein folding, transport and function.</p> <p>Results</p> <p>We use the random forest algorithm and pairwise patterns to predict glycosylation sites. We identify pairwise patterns surrounding glycosylation sites and use an odds ratio to weight their propensity of association with modified residues. Our prediction program, GPP (glycosylation prediction program), predicts glycosylation sites with an accuracy of 90.8% for Ser sites, 92.0% for Thr sites and 92.8% for Asn sites. This is significantly better than current glycosylation predictors. We use the trepan algorithm to extract a set of comprehensible rules from GPP, which provide biological insight into all three major glycosylation types.</p> <p>Conclusion</p> <p>We have created an accurate predictor of glycosylation sites and used this to extract comprehensible rules about the glycosylation process. GPP is available online at <url>http://comp.chem.nottingham.ac.uk/glyco/</url>.</p

    MOCCA: a fexible suite for modelling DNA sequence motif occurrence combinatorics

    Get PDF
    Background Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA.publishedVersio

    Impact of Wild Loci on the Allergenic Potential of Cultivated Tomato Fruits

    Get PDF
    Tomato (Solanum lycopersicum) is one of the most extensively consumed vegetables but, unfortunately, it is also able to induce allergic reactions. In the past, it has been shown that the choice of tomato cultivar significantly influenced the allergic reaction of tomato allergic subjects. In this study we investigated the allergenic potential of the cultivated tomato line M82 and of two selected lines carrying small chromosome regions from the wild species Solanum pennellii (i.e. IL7-3 and IL12-4). We evaluated the positive interactions of IgEs of allergic subjects in order to investigate the different allergenic potential of the lines under investigation. We used proteomic analyses in order to identify putative tomato allergens. In addition, bioinformatic and transcriptomic approaches were applied in order to analyse the structure and the expression profiles of the identified allergen-encoding genes. These analyses demonstrated that fruits harvested from the two selected introgression lines harbour a different allergenic potential as those from the cultivated genotype M82. The different allergenicity found within the three lines was mostly due to differences in the IgE recognition of a polygalacturonase enzyme (46 kDa), one of the major tomato allergens, and of a pectin methylesterase (34 kDa); both the proteins were more immunoreactive in IL7-3 compared to IL12-4 and M82. The observed differences in the allergenic potential were mostly due to line-dependent translational control or post-translational modifications of the allergens. We demonstrated, for the first time, that the introgression from a wild species (S. pennellii) in the genomic background of a cultivated tomato line influences the allergenic properties of the fruits. Our findings could support the isolation of favorable wild loci promoting low allergenic potential in tomato

    A Protein-Protein Interaction Map of the Trypanosoma brucei Paraflagellar Rod

    Get PDF
    We have conducted a protein interaction study of components within a specific sub-compartment of a eukaryotic flagellum. The trypanosome flagellum contains a para-crystalline extra-axonemal structure termed the paraflagellar rod (PFR) with around forty identified components. We have used a Gateway cloning approach coupled with yeast two-hybrid, RNAi and 2D DiGE to define a protein-protein interaction network taking place in this structure. We define two clusters of interactions; the first being characterised by two proteins with a shared domain which is not sufficient for maintaining the interaction. The other cohort is populated by eight proteins, a number of which possess a PFR domain and sub-populations of this network exhibit dependency relationships. Finally, we provide clues as to the structural organisation of the PFR at the molecular level. This multi-strand approach shows that protein interactome data can be generated for insoluble protein complexes

    Incorporating significant amino acid pairs to identify O-linked glycosylation sites on transmembrane proteins and non-transmembrane proteins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While occurring enzymatically in biological systems, O-linked glycosylation affects protein folding, localization and trafficking, protein solubility, antigenicity, biological activity, as well as cell-cell interactions on membrane proteins. Catalytic enzymes involve glycotransferases, sugar-transferring enzymes and glycosidases which trim specific monosaccharides from precursors to form intermediate structures. Due to the difficulty of experimental identification, several works have used computational methods to identify glycosylation sites.</p> <p>Results</p> <p>By investigating glycosylated sites that contain various motifs between Transmembrane (TM) and non-Transmembrane (non-TM) proteins, this work presents a novel method, GlycoRBF, that implements radial basis function (RBF) networks with significant amino acid pairs (SAAPs) for identifying O-linked glycosylated serine and threonine on TM proteins and non-TM proteins. Additionally, a membrane topology is considered for reducing the false positives on glycosylated TM proteins. Based on an evaluation using five-fold cross-validation, the consideration of a membrane topology can reduce 31.4% of the false positives when identifying O-linked glycosylation sites on TM proteins. Via an independent test, GlycoRBF outperforms previous O-linked glycosylation site prediction schemes.</p> <p>Conclusion</p> <p>A case study of Cyclic AMP-dependent transcription factor ATF-6 alpha was presented to demonstrate the effectiveness of GlycoRBF. Web-based GlycoRBF, which can be accessed at <url>http://GlycoRBF.bioinfo.tw</url>, can identify O-linked glycosylated serine and threonine effectively and efficiently. Moreover, the structural topology of Transmembrane (TM) proteins with glycosylation sites is provided to users. The stand-alone version of GlycoRBF is also available for high throughput data analysis.</p

    AMS 3.0: prediction of post-translational modifications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present here the recent update of AMS algorithm for identification of post-translational modification (PTM) sites in proteins based only on sequence information, using artificial neural network (ANN) method. The query protein sequence is dissected into overlapping short sequence segments. Ten different physicochemical features describe each amino acid; therefore nine residues long segment is represented as a point in a 90 dimensional space. The database of sequence segments with confirmed by experiments post-translational modification sites are used for training a set of ANNs.</p> <p>Results</p> <p>The efficiency of the classification for each type of modification and the prediction power of the method is estimated here using recall (sensitivity), precision values, the area under receiver operating characteristic (ROC) curves and leave-one-out tests (LOOCV). The significant differences in the performance for differently optimized neural networks are observed, yet the AMS 3.0 tool integrates those heterogeneous classification schemes into the single consensus scheme, and it is able to boost the precision and recall values independent of a PTM type in comparison with the currently available state-of-the art methods.</p> <p>Conclusions</p> <p>The standalone version of AMS 3.0 presents an efficient way to indentify post-translational modifications for whole proteomes. The training datasets, precompiled binaries for AMS 3.0 tool and the source code are available at <url>http://code.google.com/p/automotifserver</url> under the Apache 2.0 license scheme.</p

    Software for Automated Interpretation of Mass Spectrometry Data from Glycans and Glycopeptides

    Get PDF
    The purpose of this review is to provide those interested in glycosylation analysis with the most updated information on the availability of automated tools for MS characterization of N-linked and O-linked glycosylation types. Specifically, this review describes software tools that facilitate elucidation of glycosylation from MS data on the basis of mass alone, as well as software designed to speed the interpretation of glycan and glycopeptide fragmentation from MS/MS data. This review focuses equally on software designed to interpret the composition of released glycans and on tools to characterize N-linked and O-linked glycopeptides. Several websites have been compiled and described that will be helpful to the reader who is interested in further exploring the described tools

    Automatic structure classification of small proteins using random forest

    Get PDF
    <p>Abstract</p> <p><b>Background</b></p> <p>Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs.</p> <p><b>Result</b>s</p> <p>Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP <it>Class, Fold, Super-family </it>or <it>Family </it>levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases.</p> <p>Conclusions</p> <p>The utility of random forest in classifying domains from the place-holder classes of SCOP to the true <it>Class, Fold, Super-family </it>or <it>Family </it>levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.</p

    Propensity based classification: Dehalogenase and non-dehalogenase enzymes

    Get PDF
    The present work was designed to classify and differentiate between the dehalogenase enzyme to non–dehalogenases (other hydrolases) by taking the amino acid propensity at the core, surface and both the parts. The data sets were made on an individual basis by selecting the 3D structures of protein available in the PDB (Protein Data Bank). The prediction of the core amino acid were predicted by IPFP tool and their structural propensity calculation was performed by an in-house built software, Propensity Calculator which is available online. All datasets were finally grouped into two categories namely, dehalogenase and non-dehalogenase using Naïve Bayes, J-48, Random forest, K-means clustering and SMO classification algorithm. By making the comparison of various classification methods, the proposed tree method (Random forest) performs well with a classification accuracy of 98.88 % (maximum) for the core propensity data set. Therefore we proposed that, the core amino acid propensity could be approved as a novel potential descriptor for the classification of enzymes
    corecore