19 research outputs found

    MisPred: a resource for identification of erroneous protein sequences in public databases

    Get PDF
    Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats

    FixPred: a resource for correction of erroneous protein sequences.

    Get PDF
    Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL: http://www.fixpred.com

    Új, orvosbiológiai szempontból fontos moduláris fehérjék azonosítása, szerkezeti és funkcionális jellemzése = Identification and structure-function studies on novel medically important modular proteins

    Get PDF
    A gerincesekre jellemző moduláris fehérjék kitüntetett orvosbiológiai jelentősége miatt fontos ezeknek a fehérjéknek az azonosítása, funkciójuk tisztázása. 1.) Az általunk korábban azonosított WFIKKN1 és WFIKKN2 multidomén fehérjék közül a WFIKKN2 -ről ismert, hogy kötődik miosztatinhoz. SPR mérésekkel jellemeztük a fehérjék és doménjeik kölcsönhatását miosztatinnal. NMR spektroszkópiával meghatároztuk a WFIKKN1 második Kunitz típusú doménjének térszerkezetét. Az SPR mérések, a KU2 domén térszerkezete és a korábbi enzimkinetikai méréseink eredményei alapján valószinűnek tűnik, hogy a WFIKKN1 fehérje is egy TGF-beta családba tartozó fehérje aktivitásának szabályozásában játszik szerepet. 2.) Meghatároztuk a Wnt jelátvitelben részt vevő WIF-1 fehérje WIF doménjének térszerkezetét. A domén térszerkezetét az immunoglobulin domének szerkezetére emlékeztető 8 beta redő által alkotott szendvics szerkezet és két alfa helix jellemzi. Az NMR spektroszkópiai térszerkezet meghatározásokat Gottfried Otting munkacsoportjával együttműködésben végeztük. 3.) A bioinformatikai módszerekkel azonosított gének jelentős hányadáról bizonyosodik be, hogy megjósolt szerkezetük téves. Módszert dolgoztunk ki olyan fehérjék azonosítására, melyek szerkezete ellentmond alapvető fehérjeszerkezeti törvényszerűségeknek. A módszer alkalmas abnormális fehérjeformák azonosítására és lehetőséget nyújt a génpredikciós eljárások minőségellenőrzésére. Az elemzés menete a http://mispred.enzim.hu honlapon megtalálható. | 1.) Recently we have identified two novel proteins (WFIKKN1 and WFIKKN2) one of which (WFIKKN2) has been shown to bind to myostatin and inhibit myostatin activity. We have shown that - similarly to the homologous protein - WFIKKN1 also binds to myostatin and thus may also participate in the regulation of the activity of TGF-beta family members. 2.) In collaboration with the NMR group of Gottfried Otting we have determined the three-dimensional structure of the WIF domain of Wnt- Inhibitory-Factor-1. The fold consists of an eight-stranded beta-sandwich reminiscent of the immunoglobulin fold. 3.) The predicted structure of a significant proportion of the computationally predicted genes is incorrect, therefore we have developed tools for the identification of mispredicted genes. The rationale of this approach is that a gene is suspected to be mispredicted if some features of the encoded protein conflict with our current knowledge about proteins. With the help of this tool we have shown that a significant proportion of mRNAs produced by alternative splicing encode non-viable, aberrant proteins with no physiological role

    Identification and correction of abnormal, incomplete and mispredicted proteins in public databases

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.</p> <p>Results</p> <p>Analyses of predicted EnsEMBL protein sequences of nine deuterostome (<it>Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio </it>and <it>Ciona intestinalis</it>) and two protostome species (<it>Caenorhabditis elegans </it>and <it>Drosophila melanogaster</it>) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.</p> <p>Conclusion</p> <p>MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.</p

    Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Errors Caused by Confusing Paralogs and Epaktologs

    No full text
    In the accompanying paper (Nagy, Szláma, Szarka, Trexler, Bányai, Patthy, Reassessing Domain Architecture Evolution of Metazoan Proteins: Major Impact of Gene Prediction Errors) we showed that in the case of UniProtKB/TrEMBL, RefSeq, EnsEMBL and NCBI’s GNOMON predicted protein sequences of Metazoan species the contribution of erroneous (incomplete, abnormal, mispredicted) sequences to domain architecture (DA) differences of orthologous proteins might be greater than those of true gene rearrangements. Based on these findings, we suggest that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. In this manuscript we examine the impact of confusing paralogous and epaktologous multidomain proteins (i.e., those that are related only through the independent acquisition of the same domain types) on conclusions drawn about DA evolution of multidomain proteins in Metazoa. To estimate the contribution of this type of error we have used as reference UniProtKB/Swiss-Prot sequences from protein families with well-characterized evolutionary histories. We have used two types of paralogy-group construction procedures and monitored the impact of various parameters on the separation of true paralogs from epaktologs on correctly annotated Swiss-Prot entries of multidomain proteins. Our studies have shown that, although public protein family databases are contaminated with epaktologs, analysis of the structure of sequence similarity networks of multidomain proteins provides an efficient means for the separation of epaktologs and paralogs. We have also demonstrated that contamination of protein families with epaktologs increases the apparent rate of DA change and introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences.We have shown that confusing paralogous and epaktologous multidomain proteins significantly increases the apparent rate of DA change in Metazoa and introduces a positional bias in favor of terminal over internal DA changes. Our findings caution that earlier studies based on analysis of datasets of protein families that were contaminated with epaktologs may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of multidomain proteins is presented in an accompanying paper [1]
    corecore