72 research outputs found

    Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes

    Get PDF
    “Phylogenetic profiling” is based on the hypothesis that during evolution functionally or physically interacting genes are likely to be inherited or eliminated in a codependent manner. Creating presence–absence profiles of orthologous genes is now a common and powerful way of identifying functionally associated genes. In this approach, correctly determining orthology, as a means of identifying functional equivalence between two genes, is a critical and nontrivial step and largely explains why previous work in this area has mainly focused on using presence–absence profiles in prokaryotic species. Here, we demonstrate that eukaryotic genomes have a high proportion of multigene families whose phylogenetic profile distributions are poor in presence–absence information content. This feature makes them prone to orthology mis-assignment and unsuited to standard profile-based prediction methods. Using CATH structural domain assignments from the Gene3D database for 13 complete eukaryotic genomes, we have developed a novel modification of the phylogenetic profiling method that uses genome copy number of each domain superfamily to predict functional relationships. In our approach, superfamilies are subclustered at ten levels of sequence identity—from 30% to 100%—and phylogenetic profiles built at each level. All the profiles are compared using normalised Euclidean distances to identify those with correlated changes in their domain copy number. We demonstrate that two protein families will “auto-tune” with strong co-evolutionary signals when their profiles are compared at the similarity levels that capture their functional relationship. Our method finds functional relationships that are not detectable by the conventional presence–absence profile comparisons, and it does not require a priori any fixed criteria to define orthologous genes

    Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

    Get PDF
    We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves

    Gene3D: modelling protein structure, function and evolution

    Get PDF
    The Gene3D release 4 database and web portal () provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives—including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein–protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers

    New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

    Get PDF
    CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily

    The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution

    Get PDF
    We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt

    Microreact: visualizing and sharing data for genomic epidemiology and phylogeography

    Get PDF
    Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets

    Unlocking the Potential of Genomic Data to Inform Typhoid Fever Control Policy: Supportive Resources for Genomic Data Generation, Analysis, and Visualization

    Get PDF
    The global response to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic demonstrated the value of timely and open sharing of genomic data with standardized metadata to facilitate monitoring of the emergence and spread of new variants. Here, we make the case for the value of Salmonella Typhi (S. Typhi) genomic data and demonstrate the utility of freely available platforms and services that support the generation, analysis, and visualization of S. Typhi genomic data on the African continent and more broadly by introducing the Africa Centres for Disease Control and Prevention's Pathogen Genomics Initiative, SEQAFRICA, Typhi Pathogenwatch, TyphiNET, and the Global Typhoid Genomics Consortium

    Europe-wide expansion and eradication of multidrug-resistant Neisseria gonorrhoeae lineages: a genomic surveillance study

    Get PDF
    Background: Genomic surveillance using quality-assured whole-genome sequencing (WGS) together with epidemiological and antimicrobial resistance (AMR) data is essential to characterise the circulating Neisseria gonorrhoeae lineages and their association to patient groups (defined by demographic and epidemiological factors). In 2013, the European gonococcal population was characterised genomically for the first time. We describe the European gonococcal population in 2018 and identify emerging or vanishing lineages associated with AMR and epidemiological characteristics of patients, to elucidate recent changes in AMR and gonorrhoea epidemiology in Europe. Methods: We did WGS on 2375 gonococcal isolates from 2018 (mainly Sept 1-Nov 30) in 26 EU and EEA countries. Molecular typing and AMR determinants were extracted from quality-checked genomic data. Association analyses identified links between genomic lineages, AMR, and epidemiological data. Findings: Azithromycin-resistant N gonorrhoeae (8·0% [191/2375] in 2018) is rising in Europe due to the introduction or emergence and subsequent expansion of a novel N gonorrhoeae multi-antigen sequence typing (NG-MAST) genogroup, G12302 (132 [5·6%] of 2375; N gonorrhoeae sequence typing for antimicrobial resistance [NG-STAR] clonal complex [CC]168/63), carrying a mosaic mtrR promoter and mtrD sequence and found in 24 countries in 2018. CC63 was associated with pharyngeal infections in men who have sex with men. Susceptibility to ceftriaxone and cefixime is increasing, as the resistance-associated lineage, NG-MAST G1407 (51 [2·1%] of 2375), is progressively vanishing since 2009-10. Interpretation: Enhanced gonococcal AMR surveillance is imperative worldwide. WGS, linked to epidemiological and AMR data, is essential to elucidate the dynamics in gonorrhoea epidemiology and gonococcal populations as well as to predict AMR. When feasible, WGS should supplement the national and international AMR surveillance programmes to elucidate AMR changes over time. In the EU and EEA, increasing low-level azithromycin resistance could threaten the recommended ceftriaxone-azithromycin dual therapy, and an evidence-based clinical azithromycin resistance breakpoint is needed. Nevertheless, increasing ceftriaxone susceptibility, declining cefixime resistance, and absence of known resistance mutations for new treatments (zoliflodacin, gepotidacin) are promising.This study was supported by the European Centre for Disease Prevention and Control, the Centre for Genomic Pathogen Surveillance, the Li Ka Shing Foundation (Big Data Institute, University of Oxford), the Wellcome Genome Campus, the Foundation for Medical Research at Örebro University Hospital, and grants from Wellcome (098051 and 099202). LSB was funded by Conselleria de Sanitat Universal i Salut Pública, Generalitat Valenciana (Plan GenT CDEI-06/20-B), Valencia, Spain, and Ministry of Science, Innovation and Universities (PID2020–120113RA-I00), Spain, at the time of analysing and writing this manuscript.S

    An integrated approach to the interpretation of Single Amino Acid Polymorphisms within the framework of CATH and Gene3D

    Get PDF
    Background The phenotypic effects of sequence variations in protein-coding regions come about primarily via their effects on the resulting structures, for example by disrupting active sites or affecting structural stability. In order better to understand the mechanisms behind known mutant phenotypes, and predict the effects of novel variations, biologists need tools to gauge the impacts of DNA mutations in terms of their structural manifestation. Although many mutations occur within domains whose structure has been solved, many more occur within genes whose protein products have not been structurally characterized.<p></p> Results Here we present 3DSim (3D Structural Implication of Mutations), a database and web application facilitating the localization and visualization of single amino acid polymorphisms (SAAPs) mapped to protein structures even where the structure of the protein of interest is unknown. The server displays information on 6514 point mutations, 4865 of them known to be associated with disease. These polymorphisms are drawn from SAAPdb, which aggregates data from various sources including dbSNP and several pathogenic mutation databases. While the SAAPdb interface displays mutations on known structures, 3DSim projects mutations onto known sequence domains in Gene3D. This resource contains sequences annotated with domains predicted to belong to structural families in the CATH database. Mappings between domain sequences in Gene3D and known structures in CATH are obtained using a MUSCLE alignment. 1210 three-dimensional structures corresponding to CATH structural domains are currently included in 3DSim; these domains are distributed across 396 CATH superfamilies, and provide a comprehensive overview of the distribution of mutations in structural space.<p></p> Conclusion The server is publicly available at http://3DSim.bioinfo.cnio.es/ webcite. In addition, the database containing the mapping between SAAPdb, Gene3D and CATH is available on request and most of the functionality is available through programmatic web service access.<p></p&gt
    corecore