36 research outputs found

    Analysis of Next-generation Sequencing Data in Virology - Opportunities and Challenges

    Get PDF
    Viruses are the most abundant and the smallest organisms, which are relatively simple to sequence. Genome sequence data of viruses for individual species to populations outnumber that of other species. Although this offers an opportunity to study viral diversity at varying levels of taxonomic hierarchy, it also poses challenges for systematic and structured organization of data and its downstream processing. Extensive computational analyses using a number of algorithms and programs have opened exciting opportunities for virus discovery and diagnostics, apart from augmenting our understanding of the intriguing world of viruses. Unravelling evolutionary dynamics of viruses permits improved understanding of phenomena such as quasispecies diversity, role of mutations in host switching and drug resistance, which enables the tangible measurements of genotype and phenotype of viruses. Improved understanding of geno-/serotype diversity in correlation with antigenic diversity will facilitate rational design and development of efficacious vaccines against emerging and re-emerging viruses. Mathematical models developed using the genomic data could be used to predict the spread of viruses due to vector switching and the (re)emergence due to host switching and, thereby, contribute towards designing public health policies for disease management and control

    KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units

    Get PDF
    Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity

    KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units

    Get PDF
    Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, Orengo CA. KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units. Biomolecules. 2023; 13(2):277. https://doi.org/10.3390/biom1302027

    CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds

    Get PDF
    CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape

    CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models

    Get PDF
    MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art HMM-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein Language Models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4%, and 98.2 ± 0.3% respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold 2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models can be found on https://github.com/vam-sin/CATHe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Exploring structural diversity across the protein universe with The Encyclopedia of Domains

    Get PDF
    The AlphaFold Protein Structure Database (AFDB) contains more than 214 million predicted protein structures composed of domains, which are independently folding units found in multiple structural and functional contexts. Identifying domains can enable many functional and evolutionary analyses but has remained challenging because of the sheer scale of the data. Using deep learning methods, we have detected and classified every domain in the AFDB, producing The Encyclopedia of Domains. We detected nearly 365 million domains, over 100 million more than can be found by sequence methods, covering more than 1 million taxa. Reassuringly, 77% of the nonredundant domains are similar to known superfamilies, greatly expanding representation of their domain space. We uncovered more than 10,000 new structural interactions between superfamilies and thousands of new folds across the fold space continuum

    CATH v4.4: major expansion of CATH by experimental and predicted structural data

    Get PDF
    CATH (https://www.cathdb.info) is a structural classification database that assigns domains to the structures in the Protein Data Bank (PDB) and AlphaFold Protein Structure Database (AFDB) and adds layers of biological information, including homology and functional annotation. This article covers developments in the CATH classification since 2021. We report the significant expansion of structural information (180-fold) for CATH superfamilies through classification of PDB domains and predicted domain structures from the Encyclopedia of Domains (TED) resource. TED provides information on predicted domains in AFDB. CATH v4.4 represents an expansion of ∼64 844 experimentally determined domain structures from PDB. We also present a mapping of ∼90 million predicted domains from TED to CATH superfamilies. New PDB and TED data increases the number of superfamilies from 5841 to 6573, folds from 1349 to 2078 and architectures from 41 to 77. TED data comprises predicted structures, so these new folds and architectures remain hypothetical until experimentally confirmed. CATH also classifies domains into functional families (FunFams) within a superfamily. We have updated sequences in FunFams by scanning FunFam-HMMs against UniProt release 2024_02, giving a 276% increase in FunFams coverage. The mapping of TED structural domains has resulted in a 4-fold increase in FunFams with structural information

    AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

    Get PDF
    Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence

    Mycobacterial genomics and structural bioinformatics: opportunities and challenges in drug discovery.

    Get PDF
    Of the more than 190 distinct species of Mycobacterium genus, many are economically and clinically important pathogens of humans or animals. Among those mycobacteria that infect humans, three species namely Mycobacterium tuberculosis (causative agent of tuberculosis), Mycobacterium leprae (causative agent of leprosy) and Mycobacterium abscessus (causative agent of chronic pulmonary infections) pose concern to global public health. Although antibiotics have been successfully developed to combat each of these, the emergence of drug-resistant strains is an increasing challenge for treatment and drug discovery. Here we describe the impact of the rapid expansion of genome sequencing and genome/pathway annotations that have greatly improved the progress of structure-guided drug discovery. We focus on the applications of comparative genomics, metabolomics, evolutionary bioinformatics and structural proteomics to identify potential drug targets. The opportunities and challenges for the design of drugs for M. tuberculosis, M. leprae and M. abscessus to combat resistance are discussed
    corecore