Search CORE

91 research outputs found

New Knowledge from Old: In silico discovery of novel protein domains in Streptomyces coelicolor

Author: Bateman Alex
Bentley Stephen
Yeats Corin
Publication venue: BioMed Central
Publication date: 01/01/2003
Field of study

BACKGROUND: Streptomyces coelicolor has long been considered a remarkable bacterium with a complex life-cycle, ubiquitous environmental distribution, linear chromosomes and plasmids, and a huge range of pharmaceutically useful secondary metabolites. Completion of the genome sequence demonstrated that this diversity carried through to the genetic level, with over 7000 genes identified. We sought to expand our understanding of this organism at the molecular level through identification and annotation of novel protein domains. Protein domains are the evolutionary conserved units from which proteins are formed. RESULTS: Two automated methods were employed to rapidly generate an optimised set of targets, which were subsequently analysed manually. A final set of 37 domains or structural repeats, represented 204 times in the genome, was developed. Using these families enabled us to correlate items of information from many different resources. Several immediately enhance our understanding both of S. coelicolor and also general bacterial molecular mechanisms, including cell wall biosynthesis regulation and streptomycete telomere maintenance. DISCUSSION: Delineation of protein domain families enables detailed analysis of protein function, as well as identification of likely regions or residues of particular interest. Hence this kind of prior approach can increase the rate of discovery in the laboratory. Furthermore we demonstrate that using this type of in silico method it is possible to fairly rapidly generate new biological information from previously uncorrelated data

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UCL Discovery

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Author: Lee David
Maibaum Michael
Marsden Russell L.
Orengo Christine A.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 15/02/2006
Field of study

We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves

Crossref

PubMed Central

Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes

Author: Alastair Grant
Burkhard Rost
Christine A Orengo
Corin Yeats
Juan A. G Ranea
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

“Phylogenetic profiling” is based on the hypothesis that during evolution functionally or physically interacting genes are likely to be inherited or eliminated in a codependent manner. Creating presence–absence profiles of orthologous genes is now a common and powerful way of identifying functionally associated genes. In this approach, correctly determining orthology, as a means of identifying functional equivalence between two genes, is a critical and nontrivial step and largely explains why previous work in this area has mainly focused on using presence–absence profiles in prokaryotic species. Here, we demonstrate that eukaryotic genomes have a high proportion of multigene families whose phylogenetic profile distributions are poor in presence–absence information content. This feature makes them prone to orthology mis-assignment and unsuited to standard profile-based prediction methods. Using CATH structural domain assignments from the Gene3D database for 13 complete eukaryotic genomes, we have developed a novel modification of the phylogenetic profiling method that uses genome copy number of each domain superfamily to predict functional relationships. In our approach, superfamilies are subclustered at ten levels of sequence identity—from 30% to 100%—and phylogenetic profiles built at each level. All the profiles are compared using normalised Euclidean distances to identify those with correlated changes in their domain copy number. We demonstrate that two protein families will “auto-tune” with strong co-evolutionary signals when their profiles are compared at the similarity levels that capture their functional relationship. Our method finds functional relationships that are not detectable by the conventional presence–absence profile comparisons, and it does not require a priori any fixed criteria to define orthologous genes

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Gene3D: modelling protein structure, function and evolution

Author: Addou Sarah
Dibley Mark
Lee David
Maibaum Michael
Marsden Russell
Orengo Christine A.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 28/12/2005
Field of study

The Gene3D release 4 database and web portal () provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives—including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein–protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers

CiteSeerX

Crossref

PubMed Central

Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis

Author: Dessailly Benoit H
Lees Jonathan
Orengo Christine A.
Perkins James Richard
Rentzsch Robert
Sillitoe Ian
Yeats Corin
Publication venue: Oxford University Press
Publication date: 01/12/2011
Field of study

Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein-protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies

PubMed Central

UCL Discovery

Repositorio Institucional Universidad de Málaga

The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution

Author: Addou Sarah
Cuff Alison
Dallman Tim
Dibley Mark
Greene Lesley H.
Lewis Tony E.
Nambudiry Rekha
Orengo Christine A.
Pearl Frances
Redfern Oliver
Reid Adam
Sillitoe Ian
Thornton Janet M.
Yeats Corin
Publication venue: Oxford University Press
Publication date: 29/11/2006
Field of study

We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt

Sussex Research Online

New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.

Author: Cuff Alison L
Dawson Natalie L
Dessailly Benoit H
Furnham Nicholas
Lee David
Lees Jonathan G
Lewis Tony E
Orengo Christine A
Rentzsch Robert
Sillitoe Ian
Studer Romain A
Thornton Janet M
Yeats Corin
Publication venue: 'Oxford University Press (OUP)'
Publication date: 29/11/2012
Field of study

CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily

Crossref

LSHTM Research Online

PubMed Central

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography

Author: Aanensen David M.
Abudahab Khalil
Argimón Silvia
Bhai Jyothish
Fedosejev Artemij
Feil Edward J.
Glasner Corinna
Goater Richard J.
Grundmann Hajo
Holden Matthew Thomas Geoffrey
Infection Group
School of Medicine
Spratt Brian G.
Yeats Corin A.
Publication venue: 'Microbiology Society'
Publication date: 19/10/2016
Field of study

Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets

Crossref

PubMed Central

Oxford University Research Archive

Spiral - Imperial College Digital Repository

University of St. Andrews - Pure

St Andrews Research Repository

Gene3D: merging structure and function for a Thousand genomes

Author: Andrew Clegg
Berman
Chatr-aryamontri
Christine Orengo
Corin Yeats
Cuff
Finn
Hubbard
Hunter
Jensen
Jonathan Lees
Kanehisa
Karplus
Kerrien
Kersey
Krogh
Letunic
Lupas
Mi
Oliver Redfern
Ostergard
Pruitt
Rattei
Sayers
Sillitoe
The Gene Ontology Consortium
UniProt Consortium
Velankar
Ward
Wilson
Wootton
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website (http://gene3d.biochem.ucl.ac.uk/). Gene3D provides accurate structural domain family assignments for over 1100 genomes and nearly 10 000 000 proteins. A hidden Markov model library, constructed from the manually curated CATH structural domain hierarchy, is used to search UniProt, RefSeq and Ensembl protein sequences. The resulting matches are refined into simple multi-domain architectures using a recently developed in-house algorithm, DomainFinder 3 (available at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/). The domain assignments are integrated with multiple external protein function descriptions (e.g. Gene Ontology and KEGG), structural annotations (e.g. coiled coils, disordered regions and sequence polymorphisms) and family resources (e.g. Pfam and eggNog) and displayed on the Gene3D website. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. Gene3D also provides a set of services, including an interactive genome coverage graph visualizer, DAS annotation resources, sequence search facilities and SOAP services

Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool.

Author: Aanensen David M
Abu-Dahab Khalil
Attwood Stephen W
Colquhoun Rachel
du Plessis Louis
Hill Verity
Holmes Edward C
Jackson Ben
Maloney Daniel
McCrone John T
Medd Nathan
O'Toole Áine
Pybus Oliver G
Rambaut Andrew
Ruis Chris
Scher Emily
Taylor Ben
Underwood Anthony
Yeats Corin
Publication venue: Virus Evol
Publication date: 01/01/2021
Field of study

Funder: Oxford Martin School, University of OxfordThe response of the global virus genomics community to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been unprecedented, with significant advances made towards the 'real-time' generation and sharing of SARS-CoV-2 genomic data. The rapid growth in virus genome data production has necessitated the development of new analytical methods that can deal with orders of magnitude of more genomes than previously available. Here, we present and describe Phylogenetic Assignment of Named Global Outbreak Lineages (pangolin), a computational tool that has been developed to assign the most likely lineage to a given SARS-CoV-2 genome sequence according to the Pango dynamic lineage nomenclature scheme. To date, nearly two million virus genomes have been submitted to the web-application implementation of pangolin, which has facilitated the SARS-CoV-2 genomic epidemiology and provided researchers with access to actionable information about the pandemic's transmission lineages

Repository for Publications and Research Data

PubMed Central

Edinburgh Research Explorer

Sydney eScholarship

Apollo (Cambridge)