78,050 research outputs found
Protein function prediction using domain families
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons
Functional classification of CATH superfamilies: a domain-based approach for protein function annotation
Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterised. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional subclassification of CATH superfamilies. The superfamilies are subclassified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer
FLORA: a novel method to predict protein function from structure in diverse superfamilies
Predicting protein function from structure remains an active area of interest, particularly for the structural genomics initiatives where a substantial number of structures are initially solved with little or no functional characterisation. Although global structure comparison methods can be used to transfer functional annotations, the relationship between fold and function is complex, particularly in functionally diverse superfamilies that have evolved through different secondary structure embellishments to a common structural core. The majority of prediction algorithms employ local templates built on known or predicted functional residues. Here, we present a novel method (FLORA) that automatically generates structural motifs associated with different functional sub-families (FSGs) within functionally diverse domain superfamilies. Templates are created purely on the basis of their specificity for a given FSG, and the method makes no prior prediction of functional sites, nor assumes specific physico-chemical properties of residues. FLORA is able to accurately discriminate between homologous domains with different functions and substantially outperforms (a 2–3 fold increase in coverage at low error rates) popular structure comparison methods and a leading function prediction method. We benchmark FLORA on a large data set of enzyme superfamilies from all three major protein classes (α, β, αβ) and demonstrate the functional relevance of the motifs it identifies. We also provide novel predictions of enzymatic activity for a large number of structures solved by the Protein Structure Initiative. Overall, we show that FLORA is able to effectively detect functionally similar protein domain structures by purely using patterns of structural conservation of all residues
Protein function annotation using protein domain family resources
As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap. This article reviews the current approaches of protein function prediction using structure and sequence based classification of protein domain family resources with a special focus on functional families in the CATH-Gene3D resource
Functional classification of protein domain superfamilies for protein function annotation
Proteins are made up of domains that are generally considered to be independent evolutionary and structural units having distinct functional properties. It is now well established that analysis of domains in proteins provides an effective approach to understand protein function using a `domain grammar'. Towards this end, evolutionarily-related protein domains have been classified into homologous superfamilies in CATH and SCOP databases. An ideal functional sub-classification of the domain superfamilies into `functional families' can not only help in function annotation of uncharacterised sequences but also provide a useful framework for understanding the diversity and evolution of function at the domain level. This work describes the development of a new protocol (FunFHMMer) for identifying functional families in CATH superfamilies that makes use of sequence patterns only and hence, is unaffected by the incompleteness of function annotations, annotation biases or misannotations existing in the databases. The resulting family classification was validated using known functional information and was found to generate more functionally coherent families than other domain-based protein resources. A protein function prediction pipeline was developed exploiting the functional annotations provided by the domain families which was validated by a database rollback benchmark set of proteins and an independent assessment by CAFA 2. The functional classification was found to capture the functional diversity of superfamilies well in terms of sequence, structure and the protein-context. This aided studies on evolution of protein domain function both at the superfamily level and in specific proteins of interest. The conserved positions in the functional family alignments were found to be enriched in catalytic site residues and ligand-binding site residues which led to the development of a functional site prediction tool. Lastly, the function prediction tools were assessed for annotation of moonlighting functions of proteins and a classification of moonlighting proteins was proposed based on their structure-function relationships
Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis
Phylogenomic analysis addresses the limitations of function prediction based on annotation transfer, and has been shown to enable the highest accuracy in prediction of protein molecular function. The Berkeley Phylogenomics Group provides a series of web servers for phylogenomic analysis: classification of sequences to pre-computed families and subfamilies using the PhyloFacts Phylogenomic Encyclopedia, FlowerPower clustering of proteins sharing the same domain architecture, MUSCLE multiple sequence alignment, SATCHMO simultaneous alignment and tree construction and SCI-PHY subfamily identification. The PhyloBuilder web server provides an integrated phylogenomic pipeline starting with a user-supplied protein sequence, proceeding to homolog identification, multiple alignment, phylogenetic tree construction, subfamily identification and structure prediction. The Berkeley Phylogenomics Group resources are available at http://phylogenomics.berkeley.edu
Beyond the E-value: stratified statistics for protein domain prediction
E-values have been the dominant statistic for protein sequence analysis for
the past two decades: from identifying statistically significant local sequence
alignments to evaluating matches to hidden Markov models describing protein
domain families. Here we formally show that for "stratified" multiple
hypothesis testing problems, controlling the local False Discovery Rate (lFDR)
per stratum, or partition, yields the most predictions across the data at any
given threshold on the FDR or E-value over all strata combined. For the
important problem of protein domain prediction, a key step in characterizing
protein structure, function and evolution, we show that stratifying statistical
tests by domain family yields excellent results. We develop the first
FDR-estimating algorithms for domain prediction, and evaluate how well
thresholds based on q-values, E-values and lFDRs perform in domain prediction
using five complementary approaches for estimating empirical FDRs in this
context. We show that stratified q-value thresholds substantially outperform
E-values. Contradicting our theoretical results, q-values also outperform
lFDRs; however, our tests reveal a small but coherent subset of domain
families, biased towards models for specific repetitive patterns, for which
FDRs are greatly underestimated due to weaknesses in random sequence models.
Usage of lFDR thresholds outperform q-values for the remaining families, which
have as-expected noise, suggesting that further improvements in domain
predictions can be achieved with improved modeling of random sequences.
Overall, our theoretical and empirical findings suggest that the use of
stratified q-values and lFDRs could result in improvements in a host of
structured multiple hypothesis testing problems arising in bioinformatics,
including genome-wide association studies, orthology prediction, motif
scanning, and multi-microarray analyses.Comment: 31 pages, 8 figures, does not include supplementary file
Recommended from our members
Identifying driver mutations in cancers
All cancers depend upon mutations in critical genes, which confer a selective advantage to the tumour cell. The key to understanding the contribution of a disease-associated mutation to the development and progression of cancer comes from an understanding of the consequences of that mutation on the function of the affected protein, and the impact on the pathways in which that protein is involved.
Using data from over 30 different cancers from whole-exome sequencing cancer genomic projects, I analysed over one million somatic mutations. I identified mutational hotspots within domain families by mapping small mutations to equivalent positions in multiple sequence alignments of protein domains. I found that gain of function mutations from oncogenes and loss of function mutations from tumour suppressors are normally found in different domain families and when observed in the same domain families, hotspot mutations are located at different positions within the multiple sequence alignment of the domain.
Next, I investigated the ability of seven prediction algorithms to discriminate between driver missense mutations in oncogenes and tumour suppressors. Using 19 features to describe these mutations, I then developed a random forest classifier, MOKCaRF, to distinguish between gain of function and loss of function missense mutations in cancer. MOKCaRF performs significantly better than existing algorithms.
I then evaluated the ability of six existing prediction tools to distinguish between pathogenic and neutral mutations for both inframe insertion and inframe deletion mutations. I developed my own classifiers using 11 features that perform better than the current algorithms.
Finally, using the algorithms that I developed, as well as changes in copy number and expression data for each gene, I analysed samples from 50 lung cancer patients to identify the actionable targets and potential new drug targets for each tumour
- …