Search CORE

78,050 research outputs found

Protein function prediction using domain families

Author: Orengo Christine A.
Rentzsch Robert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons

Springer - Publisher Connector

UCL Discovery

PubMed Central

Publikationsserver des Robert Koch-Instituts

Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Author: Das S
Dawson NL
Lee D
Lees JG
Orengo CA
Sillitoe I
Publication venue
Publication date: 02/07/2015
Field of study

Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterised. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional subclassification of CATH superfamilies. The superfamilies are subclassified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer

UCL Discovery

PubMed Central

FLORA: a novel method to predict protein function from structure in diverse superfamilies

Predicting protein function from structure remains an active area of interest, particularly for the structural genomics initiatives where a substantial number of structures are initially solved with little or no functional characterisation. Although global structure comparison methods can be used to transfer functional annotations, the relationship between fold and function is complex, particularly in functionally diverse superfamilies that have evolved through different secondary structure embellishments to a common structural core. The majority of prediction algorithms employ local templates built on known or predicted functional residues. Here, we present a novel method (FLORA) that automatically generates structural motifs associated with different functional sub-families (FSGs) within functionally diverse domain superfamilies. Templates are created purely on the basis of their specificity for a given FSG, and the method makes no prior prediction of functional sites, nor assumes specific physico-chemical properties of residues. FLORA is able to accurately discriminate between homologous domains with different functions and substantially outperforms (a 2–3 fold increase in coverage at low error rates) popular structure comparison methods and a leading function prediction method. We benchmark FLORA on a large data set of enzyme superfamilies from all three major protein classes (α, β, αβ) and demonstrate the functional relevance of the motifs it identifies. We also provide novel predictions of enzymatic activity for a large number of structures solved by the Protein Structure Initiative. Overall, we show that FLORA is able to effectively detect functionally similar protein domain structures by purely using patterns of structural conservation of all residues

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

UCL Discovery

PubMed Central

Protein function annotation using protein domain family resources

Author: Das S
Orengo CA
Publication venue
Publication date: 15/01/2016
Field of study

As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap. This article reviews the current approaches of protein function prediction using structure and sequence based classification of protein domain family resources with a special focus on functional families in the CATH-Gene3D resource

UCL Discovery

Functional classification of protein domain superfamilies for protein function annotation

Author: Das S
Publication venue: UCL (University College London)
Publication date: 28/10/2016
Field of study

Proteins are made up of domains that are generally considered to be independent evolutionary and structural units having distinct functional properties. It is now well established that analysis of domains in proteins provides an effective approach to understand protein function using a `domain grammar'. Towards this end, evolutionarily-related protein domains have been classified into homologous superfamilies in CATH and SCOP databases. An ideal functional sub-classification of the domain superfamilies into `functional families' can not only help in function annotation of uncharacterised sequences but also provide a useful framework for understanding the diversity and evolution of function at the domain level. This work describes the development of a new protocol (FunFHMMer) for identifying functional families in CATH superfamilies that makes use of sequence patterns only and hence, is unaffected by the incompleteness of function annotations, annotation biases or misannotations existing in the databases. The resulting family classification was validated using known functional information and was found to generate more functionally coherent families than other domain-based protein resources. A protein function prediction pipeline was developed exploiting the functional annotations provided by the domain families which was validated by a database rollback benchmark set of proteins and an independent assessment by CAFA 2. The functional classification was found to capture the functional diversity of superfamilies well in terms of sequence, structure and the protein-context. This aided studies on evolution of protein domain function both at the superfamily level and in specific proteins of interest. The conserved positions in the functional family alignments were found to be enriched in catalytic site residues and ligand-binding site residues which led to the development of a functional site prediction tool. Lastly, the function prediction tools were assessed for annotation of moonlighting functions of proteins and a classification of moonlighting proteins was proposed based on their structure-function relationships

UCL Discovery

Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis

Author: Glanville Jake Gunn
Kirshner Dan
Krishnamurthy Nandini
Sjölander Kimmen
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

Phylogenomic analysis addresses the limitations of function prediction based on annotation transfer, and has been shown to enable the highest accuracy in prediction of protein molecular function. The Berkeley Phylogenomics Group provides a series of web servers for phylogenomic analysis: classification of sequences to pre-computed families and subfamilies using the PhyloFacts Phylogenomic Encyclopedia, FlowerPower clustering of proteins sharing the same domain architecture, MUSCLE multiple sequence alignment, SATCHMO simultaneous alignment and tree construction and SCI-PHY subfamily identification. The PhyloBuilder web server provides an integrated phylogenomic pipeline starting with a user-supplied protein sequence, proceeding to homolog identification, multiple alignment, phylogenetic tree construction, subfamily identification and structure prediction. The Berkeley Phylogenomics Group resources are available at http://phylogenomics.berkeley.edu

CiteSeerX

Crossref

PubMed Central

Beyond the E-value: stratified statistics for protein domain prediction

Author: Llinás Manuel
Ochoa Alejandro
Singh Mona
Storey John D.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/03/2015
Field of study

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems, controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which FDRs are greatly underestimated due to weaknesses in random sequence models. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, motif scanning, and multi-microarray analyses.Comment: 31 pages, 8 figures, does not include supplementary file

arXiv.org e-Print Archive

Princeton University Open Access Repository

Directory of Open Access Journals

PubMed Central

FigShare

Recommended from our members

Identifying driver mutations in cancers

Author: Baeissa Hanadi
Publication venue
Publication date: 09/04/2019
Field of study

All cancers depend upon mutations in critical genes, which confer a selective advantage to the tumour cell. The key to understanding the contribution of a disease-associated mutation to the development and progression of cancer comes from an understanding of the consequences of that mutation on the function of the affected protein, and the impact on the pathways in which that protein is involved. Using data from over 30 different cancers from whole-exome sequencing cancer genomic projects, I analysed over one million somatic mutations. I identified mutational hotspots within domain families by mapping small mutations to equivalent positions in multiple sequence alignments of protein domains. I found that gain of function mutations from oncogenes and loss of function mutations from tumour suppressors are normally found in different domain families and when observed in the same domain families, hotspot mutations are located at different positions within the multiple sequence alignment of the domain. Next, I investigated the ability of seven prediction algorithms to discriminate between driver missense mutations in oncogenes and tumour suppressors. Using 19 features to describe these mutations, I then developed a random forest classifier, MOKCaRF, to distinguish between gain of function and loss of function missense mutations in cancer. MOKCaRF performs significantly better than existing algorithms. I then evaluated the ability of six existing prediction tools to distinguish between pathogenic and neutral mutations for both inframe insertion and inframe deletion mutations. I developed my own classifiers using 11 features that perform better than the current algorithms. Finally, using the algorithms that I developed, as well as changes in copy number and expression data for each gene, I analysed samples from 50 lung cancer patients to identify the actionable targets and potential new drug targets for each tumour

Sussex Research Online