32 research outputs found
Recommended from our members
FunFam protein families improve residue level molecular function prediction
Background
The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues.
Results
FunFam members agreed, on average, in 36.9 ± 0.6% of their binding residue annotations. This constituted a 6.7-fold increase over randomly grouped proteins and a 1.2-fold increase (1.1-fold on the same dataset) over proteins with the same enzymatic function (identical Enzyme Commission, EC, number). Mapping de novo binding residue prediction methods (BindPredict-CCS, BindPredict-CC) onto FunFam resulted in consensus predictions for those residues that were aligned and predicted alike (binding/non-binding) within a FunFam. This simple consensus increased the F1-score (for binding) 1.5-fold over the original prediction method. Variation of the threshold for how many proteins in the consensus prediction had to agree provided a convenient control of accuracy/precision and coverage/recall, e.g. reaching a precision as high as 60.8 ± 0.4% for a stringent threshold.
Conclusions
The FunFams outperformed even the carefully curated EC numbers in terms of agreement of binding site residues. Additionally, we assume that our proof-of-principle through the prediction of protein binding residues will be relevant for many other solutions profiting from FunFams to infer functional information at the residue level
Recommended from our members
FunFam protein families improve residue level molecular function prediction
BACKGROUND: The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues. RESULTS: FunFam members agreed, on average, in 36.9 ± 0.6% of their binding residue annotations. This constituted a 6.7-fold increase over randomly grouped proteins and a 1.2-fold increase (1.1-fold on the same dataset) over proteins with the same enzymatic function (identical Enzyme Commission, EC, number). Mapping de novo binding residue prediction methods (BindPredict-CCS, BindPredict-CC) onto FunFam resulted in consensus predictions for those residues that were aligned and predicted alike (binding/non-binding) within a FunFam. This simple consensus increased the F1-score (for binding) 1.5-fold over the original prediction method. Variation of the threshold for how many proteins in the consensus prediction had to agree provided a convenient control of accuracy/precision and coverage/recall, e.g. reaching a precision as high as 60.8 ± 0.4% for a stringent threshold. CONCLUSIONS: The FunFams outperformed even the carefully curated EC numbers in terms of agreement of binding site residues. Additionally, we assume that our proof-of-principle through the prediction of protein binding residues will be relevant for many other solutions profiting from FunFams to infer functional information at the residue level
CATH functional families predict functional sites in proteins
MOTIVATION: Identification of functional sites in proteins is essential for functional characterization, variant interpretation and drug design. Several methods are available for predicting either a generic functional site, or specific types of functional site. Here, we present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein-protein interaction functional sites using features derived from protein sequence and structure, and evolutionary data from CATH functional families (FunFams). RESULTS: FunSite's prediction performance was rigorously benchmarked using cross-validation and a holdout dataset. FunSite outperformed other publicly-available functional site prediction methods. We show that conserved residues in FunFams are enriched in functional sites. We found FunSite's performance depends greatly on the quality of functional site annotations and the information content of FunFams in the training data. Finally, we analyse which structural and evolutionary features are most predictive for functional sites. AVAILABILITY: https://github.com/UCL/cath-funsite-predictor. CONTACT: [email protected]. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
On Computable Protein Functions
Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins
AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms
Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence
Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis
Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year
CATH: comprehensive structural and functional annotations for genome sequences.
The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235,000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our 'current' putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages
Gene3D: expanding the utility of domain assignments
Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing 2737 CATH superfamilies. Gene3D has previously featured in the Database issue of NAR and here we report updates to the website and database. The current Gene3D (v14) release has expanded its domain assignments to ∼20 000 cellular genomes and over 43 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster). We also document a number of additional visualization tools in the Gene3D website
KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units
Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, Orengo CA. KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units. Biomolecules. 2023; 13(2):277. https://doi.org/10.3390/biom1302027