3 research outputs found

    Functional Annotation of Proteins using Domain Embedding based Sequence Classification

    Get PDF
    International audienceDue to the recent advancement in genomic sequencing technologies, the number of protein sequences in public databases is growing exponentially. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. The May 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 158 million protein sequences. For the complete exploitation of this huge knowledge base, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. However, there is only about half a million sequences (UniprotKB/SwissProt) are reviewed and functionally annotated by expert curators using information extracted from the published literature and computational analyses. The manual annotation by experts are expensive, slow and insufficient to fill the gap between the annotated and unannotated protein sequences. In this paper, we present an automatic functional annotation technique using neural network based based word embedding exploiting domain and family information of proteins. Domains are the most conserved regions in protein sequences and constitute the building blocks of 3D protein structures. To do the experiment, we used fastText a , a library for learning of word embeddings and text classification developed by Facebook's AI Research lab. The experimental results show that domain embeddings perform much better than k-mer based word embeddings. a https://github.com/facebookresearch/fasttex

    GrAPFI: predicting enzymatic function of proteins from domain similarity graphs

    Get PDF
    This work is dedicated to the memory of David W. Ritchie, who recently passed away.International audienceBackground: Thanks to recent developments in genomic sequencing technologies, the number of protein sequences in public databases is growing enormously. To enrich and exploit this immensely valuable data, it is essential to annotate these sequences with functional properties such as Enzyme Commission (EC) numbers, for example. The January 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 140 million protein sequences. However, only about half of a million of these (UniprotKB/SwissProt) have been reviewed and functionally annotated by expert curators using data extracted from the literature and computational analyses. To reduce the gap between the annotated and unannotated protein sequences, it is essential to develop accurate automatic protein function annotation techniques. Results: In this work, we present GrAPFI (Graph-based Automatic Protein Function Inference) for automatically annotating proteins with EC number functional descriptors from a protein domain similarity graph. We validated the performance of GrAPFI using six reference proteomes in UniprotKB/SwissProt, namely Human, Mouse, Rat, Yeast, E. Coli and Arabidopsis thaliana. We also compared GrAPFI with existing EC prediction approaches such as ECPred, DEEPre, and SVMProt. This shows that GrAPFI achieves better accuracy and comparable or better coverage with respect to these earlier approaches. Conclusions: GrAPFI is a novel protein function annotation tool that performs automatic inference on a network of proteins that are related according to their domain composition. Our evaluation of GrAPFI shows that it gives better performance than other state of the art methods. GrAPFI is available at https://gitlab.inria.fr/bsarker/bmc_grapfi.git as a stand alone tool written in Python

    Functional Annotation of Proteins using Domain Embedding based Sequence Classification

    Get PDF
    International audienceDue to the recent advancement in genomic sequencing technologies, the number of protein sequences in public databases is growing exponentially. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. The May 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 158 million protein sequences. For the complete exploitation of this huge knowledge base, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. However, there is only about half a million sequences (UniprotKB/SwissProt) are reviewed and functionally annotated by expert curators using information extracted from the published literature and computational analyses. The manual annotation by experts are expensive, slow and insufficient to fill the gap between the annotated and unannotated protein sequences. In this paper, we present an automatic functional annotation technique using neural network based based word embedding exploiting domain and family information of proteins. Domains are the most conserved regions in protein sequences and constitute the building blocks of 3D protein structures. To do the experiment, we used fastText a , a library for learning of word embeddings and text classification developed by Facebook's AI Research lab. The experimental results show that domain embeddings perform much better than k-mer based word embeddings. a https://github.com/facebookresearch/fasttex
    corecore