9,123 research outputs found
SIFTER search: a web server for accurate phylogeny-based protein function prediction.
We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded
ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network
With the development of next generation sequencing techniques, it is fast and
cheap to determine protein sequences but relatively slow and expensive to
extract useful information from protein sequences because of limitations of
traditional biological experimental techniques. Protein function prediction has
been a long standing challenge to fill the gap between the huge amount of
protein sequences and the known function. In this paper, we propose a novel
method to convert the protein function problem into a language translation
problem by the new proposed protein sequence language "ProLan" to the protein
function language "GOLan", and build a neural machine translation model based
on recurrent neural networks to translate "ProLan" language to "GOLan"
language. We blindly tested our method by attending the latest third Critical
Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the
performance of our methods on selected proteins whose function was released
after CAFA competition. The good performance on the training and testing
datasets demonstrates that our new proposed method is a promising direction for
protein function prediction. In summary, we first time propose a method which
converts the protein function prediction problem to a language translation
problem and applies a neural machine translation model for protein function
prediction.Comment: 13 pages, 5 figure
Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors
We investigate the application of hierarchical classification schemes to the
annotation of gene function based on several characteristics of protein
sequences including phylogenic descriptors, sequence based attributes, and
predicted secondary structure. We discuss three Bayesian models and compare
their performance in terms of predictive accuracy. These models are the
ordinary multinomial logit (MNL) model, a hierarchical model based on a set of
nested MNL models, and a MNL model with a prior that introduces correlations
between the parameters for classes that are nearby in the hierarchy. We also
provide a new scheme for combining different sources of information. We use
these models to predict the functional class of Open Reading Frames (ORFs) from
the E. coli genome. The results from all three models show substantial
improvement over previous methods, which were based on the C5 algorithm. The
MNL model using a prior based on the hierarchy outperforms both the
non-hierarchical MNL model and the nested MNL model. In contrast to previous
attempts at combining these sources of information, our approach results in a
higher accuracy rate when compared to models that use each data source alone.
Together, these results show that gene function can be predicted with higher
accuracy than previously achieved, using Bayesian models that incorporate
suitable prior information
Human protein function prediction: application of machine learning for integration of heterogeneous data sources
Experimental characterisation of protein cellular function can be prohibitively expensive and
take years to complete. To address this problem, this thesis focuses on the development of computational
approaches to predict function from sequence. For sequences with well characterised
close relatives, annotation is trivial, orphans or distant homologues present a greater challenge.
The use of a feature based method employing ensemble support vector machines to predict individual
Gene Ontology classes is investigated. It is found that different combinations of feature
inputs are required to recognise different functions. Although the approach is applicable to any
human protein sequence, it is restricted to broadly descriptive functions. The method is well
suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate
class assignments.
Signatures of common function can be derived from different biological characteristics; interactions
and binding events as well as expression behaviour. To investigate the hypothesis that
common function can be derived from expression information, public domain human microarray
datasets are assembled. The questions of how best to integrate these datasets and derive
features that are useful in function prediction are addressed. Both co-expression and abundance
information is represented between and within experiments and investigated for correlation with
function. It is found that features derived from expression data serve as a weak but significant
signal for recognising functions. This signal is stronger for biological processes than molecular
function categories and independent of homology information.
The protein domain has historically been coined as a modular evolutionary unit of protein function.
The occurrence of domains that can be linked by ancestral fusion events serves as a signal
for domain-domain interactions. To exploit this information for function prediction, novel domain
architecture and fused architecture scores are developed. Architecture scores rather than
single domain scores correlate more strongly with function, and both architecture and fusion
scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach
designed to target the annotation of both homologous and non-homologous proteins. Support
vector regression is used to combine pair-wise sequence features with expression scores and
domain architecture scores to rank protein pairs in terms of their functional similarities. The
target of the regression models represents the continuum of protein function space empirically
derived from the Gene Ontology molecular function and biological process graphs. The merit
and performance of the approach is demonstrated using homologous and non-homologous test
datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence
methods. The final model represents a method that achieves a compromise between
high specificity and sensitivity for all human proteins regardless of their homology status. It is
expected that this strategy will allow for more comprehensive and accurate annotations of the
human proteome
Combining Homolog and Motif Similarity Data with Gene Ontology Relationships for Protein Function Prediction
Uncharacterized proteins pose a challenge not just to functional genomics, but also to biology in general. The knowledge of biochemical functions of such proteins is very critical for designing efficient therapeutic techniques. The bot- tleneck in hypothetical proteins annotation is the difficulty in collecting and aggregating enough biological information about the protein itself. In this paper, we propose and evaluate a protein annotation technique that aggregates different biological infor- mation conserved across many hypothetical proteins. To enhance the performance and to increase the prediction accuracy, we incorporate term specific relationships based on Gene Ontology (GO). Our method combines PPI (Protein Protein Interactions) data, protein motifs information, protein sequence similarity and protein homology data, with a context similarity measure based on Gene Ontology, to accurately infer functional information for unannotated proteins. We apply our method on Saccharomyces Cerevisiae species proteins. The aggregation of different sources of evidence with GO relationships increases the precision and accuracy of prediction compared to other methods reported in literature. We predicted with a precision and accuracy of 100% for more than half proteins of the input set and with an overall 81.35% precision and 80.04% accurac
The genome of the protozoan parasite Cystoisospora suis and a reverse vaccinology approach to identify vaccine candidates
Vaccine development targeting protozoan parasites remains challenging, partly due to the complex interactions between these eukaryotes and the host immune system. Reverse vaccinology is a promising approach for direct screening of genome sequence assemblies for new vaccine candidate proteins. Here, we applied this paradigm to Cystoisospora suis, an apicomplexan parasite that causes enteritis and diarrhea in suckling piglets and economic losses in pig production worldwide. Using Next Generation Sequencing we produced an ∼84 Mb sequence assembly for the C. suis genome, making it the first available reference for the genus Cystoisospora. Then, we derived a manually curated annotation of more than 11,000 protein-coding genes and applied the tool Vacceed to identify 1,168 vaccine candidates by screening the predicted C. suis proteome. To refine the set of candidates, we looked at proteins that are highly expressed in merozoites and specific to apicomplexans. The stringent set of candidates included 220 proteins, among which were 152 proteins with unknown function, 17 surface antigens of the SAG and SRS gene families, 12 proteins of the apicomplexan-specific secretory organelles including AMA1, MIC6, MIC13, ROP6, ROP12, ROP27, ROP32 and three proteins related to cell adhesion. Finally, we demonstrated in vitro the immunogenic potential of a C. suis-specific 42 kDa transmembrane protein, which might constitute an attractive candidate for further testing
- …