4 research outputs found

    ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

    Get PDF
    Background: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition techniques to these data to learn models that effectively relate structure and function.

    Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

    Get PDF
    Background: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D-structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal's descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D-structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches

    A physics-based scoring function for protein structural decoys: Dynamic testing on targets of CASP-ROLL

    No full text
    Most successful structure prediction strategies use knowledge-based functions for global optimization, in spite of their intrinsic limited potential to create new folds, while physics-based approaches are often employed only during structure refinement steps. We here propose a physics-based scoring potential intended to perform global searches of the conformational space. We introduce a dynamic test to evaluate the discrimination power of our function, and compare it with predictions of targets from the CASP-ROLL competition. Results demonstrate that this dynamic test is able to generate 3D models which outrank 59% (according GDT-TS score) of models generated with ab initio structure prediction servers

    Proteome-wide Prediction of Lysine Methylation Leads to Identification of H2BK43 Methylation and Outlines the Potential Methyllysine Proteome

    No full text
    Protein Lys methylation plays a critical role in numerous cellular processes, but it is challenging to identify Lys methylation in a systematic manner. Here we present an approach combining in silico prediction with targeted mass spectrometry (MS) to identify Lys methylation (Kme) sites at the proteome level. We develop MethylSight, a program that predicts Kme events solely on the physicochemical properties of residues surrounding the putative methylation sites, which then requires validation by targeted MS. Using this approach, we identify 70 new histone Kme marks with a 90% validation rate. H2BK43me2, which undergoes dynamic changes during stem cell differentiation, is found to be a substrate of KDM5b. Furthermore, MethylSight predicts that Lys methylation is a prevalent post-translational modification in the human proteome. Our work provides a useful resource for guiding systematic exploration of the role of Lys methylation in human health and disease.Biggar et al. develop an algorithm to identify lysine methylation sites and use this resource to provide insight into the potential of the methyllysine proteome. The results also validate 45 new histone methylation sites by targeted mass spectrometry and show that one of these sites, H2B-K43me2, is a substrate of the KDM5B demethylase
    corecore