Search CORE

72 research outputs found

Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes

Author: Alastair Grant
Burkhard Rost
Christine A Orengo
Corin Yeats
Juan A. G Ranea
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

“Phylogenetic profiling” is based on the hypothesis that during evolution functionally or physically interacting genes are likely to be inherited or eliminated in a codependent manner. Creating presence–absence profiles of orthologous genes is now a common and powerful way of identifying functionally associated genes. In this approach, correctly determining orthology, as a means of identifying functional equivalence between two genes, is a critical and nontrivial step and largely explains why previous work in this area has mainly focused on using presence–absence profiles in prokaryotic species. Here, we demonstrate that eukaryotic genomes have a high proportion of multigene families whose phylogenetic profile distributions are poor in presence–absence information content. This feature makes them prone to orthology mis-assignment and unsuited to standard profile-based prediction methods. Using CATH structural domain assignments from the Gene3D database for 13 complete eukaryotic genomes, we have developed a novel modification of the phylogenetic profiling method that uses genome copy number of each domain superfamily to predict functional relationships. In our approach, superfamilies are subclustered at ten levels of sequence identity—from 30% to 100%—and phylogenetic profiles built at each level. All the profiles are compared using normalised Euclidean distances to identify those with correlated changes in their domain copy number. We demonstrate that two protein families will “auto-tune” with strong co-evolutionary signals when their profiles are compared at the similarity levels that capture their functional relationship. Our method finds functional relationships that are not detectable by the conventional presence–absence profile comparisons, and it does not require a priori any fixed criteria to define orthologous genes

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Author: Lee David
Maibaum Michael
Marsden Russell L.
Orengo Christine A.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 15/02/2006
Field of study

We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves

Crossref

PubMed Central

Gene3D: modelling protein structure, function and evolution

Author: Addou Sarah
Dibley Mark
Lee David
Maibaum Michael
Marsden Russell
Orengo Christine A.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 28/12/2005
Field of study

The Gene3D release 4 database and web portal () provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives—including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein–protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers

CiteSeerX

Crossref

PubMed Central

New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

Author: Cuff Alison L
Dawson Natalie L
Dessailly Benoit H
Furnham Nicholas
Lee David
Lees Jonathan G
Lewis Tony E
Orengo Christine A
Rentzsch Robert
Sillitoe Ian
Studer Romain A
Thornton Janet M
Yeats Corin
Publication venue: 'Oxford University Press (OUP)'
Publication date: 29/11/2012
Field of study

CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily

Crossref

LSHTM Research Online

PubMed Central

The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution

Author: Addou Sarah
Cuff Alison
Dallman Tim
Dibley Mark
Greene Lesley H.
Lewis Tony E.
Nambudiry Rekha
Orengo Christine A.
Pearl Frances
Redfern Oliver
Reid Adam
Sillitoe Ian
Thornton Janet M.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 29/11/2006
Field of study

We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt

Sussex Research Online

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography

Author: Aanensen David M.
Abudahab Khalil
Argimón Silvia
Bhai Jyothish
Fedosejev Artemij
Feil Edward J.
Glasner Corinna
Goater Richard J.
Grundmann Hajo
Holden Matthew Thomas Geoffrey
Infection Group
School of Medicine
Spratt Brian G.
Yeats Corin A.
Publication venue: 'Microbiology Society'
Publication date: 19/10/2016
Field of study

Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets

Crossref

PubMed Central

Oxford University Research Archive

Spiral - Imperial College Digital Repository

University of St. Andrews - Pure

St Andrews Research Repository

Recommended from our members

A global resource for genomic predictions of antimicrobial resistance and surveillance of Salmonella Typhi at pathogenwatch.

Author: Aanensen David M
Abudahab Khalil
Argimón Silvia
Baker Stephen
Dougan Gordon
Dyson Zoe A
Goater Richard J
Holt Kathryn E
Keane Jacqueline A
Marks Florian
Nair Satheesh
Page Andrew J
Park Se Eun
Sánchez-Busó Leonor
Taylor Benjamin
Underwood Anthony
Wong Vanessa K
Yeats Corin A
Publication venue: Nature communications
Publication date: 19/06/2021
Field of study

As whole-genome sequencing capacity becomes increasingly decentralized, there is a growing opportunity for collaboration and the sharing of surveillance data within and between countries to inform typhoid control policies. This vision requires free, community-driven tools that facilitate access to genomic data for public health on a global scale. Here we present the Pathogenwatch scheme for Salmonella enterica serovar Typhi (S. Typhi), a web application enabling the rapid identification of genomic markers of antimicrobial resistance (AMR) and contextualization with public genomic data. We show that the clustering of S. Typhi genomes in Pathogenwatch is comparable to established bioinformatics methods, and that genomic predictions of AMR are highly concordant with phenotypic susceptibility data. We demonstrate the public health utility of Pathogenwatch with examples selected from >4,300 public genomes available in the application. Pathogenwatch provides an intuitive entry point to monitor of the emergence and spread of S. Typhi high risk clones

Apollo (Cambridge)

Unlocking the Potential of Genomic Data to Inform Typhoid Fever Control Policy: Supportive Resources for Genomic Data Generation, Analysis, and Visualization

Author: Aanensen David
Argimón Silvia
Baker Stephen
Carey Megan E
Cerdeira Louise
Dyson Zoe A
Holt Kathryn E
Mboowa Gerald
Okeke Iruka N
Smith Anthony M
Tessema Sofonias K
Yeats Corin
Publication venue: 'Oxford University Press (OUP)'
Publication date: 02/06/2023
Field of study

The global response to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic demonstrated the value of timely and open sharing of genomic data with standardized metadata to facilitate monitoring of the emergence and spread of new variants. Here, we make the case for the value of Salmonella Typhi (S. Typhi) genomic data and demonstrate the utility of freely available platforms and services that support the generation, analysis, and visualization of S. Typhi genomic data on the African continent and more broadly by introducing the Africa Centres for Disease Control and Prevention's Pathogen Genomics Initiative, SEQAFRICA, Typhi Pathogenwatch, TyphiNET, and the Global Typhoid Genomics Consortium

LSTM Online Archive

LSHTM Research Online

Europe-wide expansion and eradication of multidrug-resistant Neisseria gonorrhoeae lineages: a genomic surveillance study

Author: Aanensen David M
Abad Raquel
Abudahab Khalil
Bluemel Benjamin
Centre for Genomic Pathogen Surveillance
Cole Michelle J
Day Michaela
Diaz Franco Asuncion
Euro-GASP study group
Golparian Daniel
Jacobsson Susanne
Sajedi Noshin
Spiteri Gianfranco
Sánchez-Busó Leonor
Underwood Anthony
Unemo Magnus
Vazquez-Moreno Julio Alberto
Yeats Corin A
Publication venue: Elsevier
Publication date: 01/06/2022
Field of study

Background: Genomic surveillance using quality-assured whole-genome sequencing (WGS) together with epidemiological and antimicrobial resistance (AMR) data is essential to characterise the circulating Neisseria gonorrhoeae lineages and their association to patient groups (defined by demographic and epidemiological factors). In 2013, the European gonococcal population was characterised genomically for the first time. We describe the European gonococcal population in 2018 and identify emerging or vanishing lineages associated with AMR and epidemiological characteristics of patients, to elucidate recent changes in AMR and gonorrhoea epidemiology in Europe. Methods: We did WGS on 2375 gonococcal isolates from 2018 (mainly Sept 1-Nov 30) in 26 EU and EEA countries. Molecular typing and AMR determinants were extracted from quality-checked genomic data. Association analyses identified links between genomic lineages, AMR, and epidemiological data. Findings: Azithromycin-resistant N gonorrhoeae (8·0% [191/2375] in 2018) is rising in Europe due to the introduction or emergence and subsequent expansion of a novel N gonorrhoeae multi-antigen sequence typing (NG-MAST) genogroup, G12302 (132 [5·6%] of 2375; N gonorrhoeae sequence typing for antimicrobial resistance [NG-STAR] clonal complex [CC]168/63), carrying a mosaic mtrR promoter and mtrD sequence and found in 24 countries in 2018. CC63 was associated with pharyngeal infections in men who have sex with men. Susceptibility to ceftriaxone and cefixime is increasing, as the resistance-associated lineage, NG-MAST G1407 (51 [2·1%] of 2375), is progressively vanishing since 2009-10. Interpretation: Enhanced gonococcal AMR surveillance is imperative worldwide. WGS, linked to epidemiological and AMR data, is essential to elucidate the dynamics in gonorrhoea epidemiology and gonococcal populations as well as to predict AMR. When feasible, WGS should supplement the national and international AMR surveillance programmes to elucidate AMR changes over time. In the EU and EEA, increasing low-level azithromycin resistance could threaten the recommended ceftriaxone-azithromycin dual therapy, and an evidence-based clinical azithromycin resistance breakpoint is needed. Nevertheless, increasing ceftriaxone susceptibility, declining cefixime resistance, and absence of known resistance mutations for new treatments (zoliflodacin, gepotidacin) are promising.This study was supported by the European Centre for Disease Prevention and Control, the Centre for Genomic Pathogen Surveillance, the Li Ka Shing Foundation (Big Data Institute, University of Oxford), the Wellcome Genome Campus, the Foundation for Medical Research at Örebro University Hospital, and grants from Wellcome (098051 and 099202). LSB was funded by Conselleria de Sanitat Universal i Salut Pública, Generalitat Valenciana (Plan GenT CDEI-06/20-B), Valencia, Spain, and Ministry of Science, Innovation and Universities (PID2020–120113RA-I00), Spain, at the time of analysing and writing this manuscript.S

REPISALUD

An integrated approach to the interpretation of Single Amino Acid Polymorphisms within the framework of CATH and Gene3D

Author: A Petitjean
A Torkamani
A Uzun
Alfonso Valencia
Andrew B Clegg
Andrew CR Martin
Anja Baresic
BL Loeys
C Ferrer-Costa
C Ferrer-Costa
C Ferrer-Costa
C Yeats
C Yeats
Christine A Orengo
CJ Kwok
Consortium H
Corin Yeats
EWW Sayers
FS Collins
G Kemball-Cook
H Piirilä
HM Berman
JM Hurst
JMG Izarzugaza
JMG Izarzugaza
Jose MG Izarzugaza
LH Greene
Lisa EM McMillan
M Claustres
M Mort
M Tuchman
P Schattner
P Taillon-Miller
P Yue
PC Ng
R Wroe
RA Laskowski
RC Edgar
RR Gabdoulline
SEA Leigh
SF Altschul
ST Sherry
T Rattei
TJ Hubbard
U Consortium
Y Bromberg
Z Wang
Z Wang
ZE Sauna
Publication venue: BioMed Central
Publication date: 22/09/2008
Field of study

Background The phenotypic effects of sequence variations in protein-coding regions come about primarily via their effects on the resulting structures, for example by disrupting active sites or affecting structural stability. In order better to understand the mechanisms behind known mutant phenotypes, and predict the effects of novel variations, biologists need tools to gauge the impacts of DNA mutations in terms of their structural manifestation. Although many mutations occur within domains whose structure has been solved, many more occur within genes whose protein products have not been structurally characterized. Results Here we present 3DSim (3D Structural Implication of Mutations), a database and web application facilitating the localization and visualization of single amino acid polymorphisms (SAAPs) mapped to protein structures even where the structure of the protein of interest is unknown. The server displays information on 6514 point mutations, 4865 of them known to be associated with disease. These polymorphisms are drawn from SAAPdb, which aggregates data from various sources including dbSNP and several pathogenic mutation databases. While the SAAPdb interface displays mutations on known structures, 3DSim projects mutations onto known sequence domains in Gene3D. This resource contains sequences annotated with domains predicted to belong to structural families in the CATH database. Mappings between domain sequences in Gene3D and known structures in CATH are obtained using a MUSCLE alignment. 1210 three-dimensional structures corresponding to CATH structural domains are currently included in 3DSim; these domains are distributed across 396 CATH superfamilies, and provide a comprehensive overview of the distribution of mutations in structural space. Conclusion The server is publicly available at http://3DSim.bioinfo.cnio.es/ webcite. In addition, the database containing the mapping between SAAPdb, Gene3D and CATH is available on request and most of the functionality is available through programmatic web service access.</p&gt

Crossref

Springer - Publisher Connector

PubMed Central

Full-text Institutional Repository of the Ruđer Bošković Institute

Spiral - Imperial College Digital Repository

Enlighten