177 research outputs found

    Applying negative rule mining to improve genome annotation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.</p> <p>Results</p> <p>Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower.</p> <p>Conclusion</p> <p>Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.</p

    Assignment of isochores for all completely sequenced vertebrate genomes using a consensus

    Get PDF
    A new consensus isochore assignment method and a database of isochore maps for all completely sequenced vertebrate genomes are presented

    TargetSpy: a supervised machine learning approach for microRNA target prediction

    Get PDF
    [Background] Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved) seed match to the 5' end of the microRNA. Recently however, it has been shown that this requirement might be too stringent, leading to a substantial number of missed target sites. [Results] We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences. In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I) no seed match requirement, II) seed match requirement, and III) conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed) predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms. [Conclusion] Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on mouse and performs well in human and drosophila, suggesting that it may be applicable to a broad range of species. Moreover, we have demonstrated that the application of machine learning techniques in combination with upcoming deep sequencing data results in a powerful microRNA target site prediction tool http://www.targetspy.org webcite.The work of MH was supported by the Spanish Government (Grant number: BIO2008.01353) and by the Junta de Andalucia (Grant number P07-FQM-03613)

    Evolutionary interplay between symbiotic relationships and patterns of signal peptide gain and loss

    Get PDF
    Can orthologous proteins differ in terms of their ability to be secreted? To answer this question, we investigated the distribution of signal peptides within the orthologous groups of Enterobacterales. Parsimony analysis and sequence comparisons revealed a large number of signal peptide gain and loss events, in which signal peptides emerge or disappear in the course of evolution. Signal peptide losses prevail over gains, an effect which is especially pronounced in the transition from the free-living or commensal to the endosymbiotic lifestyle. The disproportionate decline in the number of signal peptide-containing proteins in endosymbionts cannot be explained by the overall reduction of their genomes. Signal peptides can be gained and lost either by acquisition/elimination of the corresponding N-terminal regions or by gradual accumulation of mutations. The evolutionary dynamics of signal peptides in bacterial proteins represents a powerful mechanism of functional diversification

    No statistical support for correlation between the positions of protein interaction sites and alternatively spliced regions

    Get PDF
    BACKGROUND: Alternative splicing is an efficient mechanism for increasing the variety of functions fulfilled by proteins in a living cell. It has been previously demonstrated that alternatively spliced regions often comprise functionally important and conserved sequence motifs. The objective of this work was to test the hypothesis that alternative splicing is correlated with contact regions of protein-protein interactions. RESULTS: Protein sequence spans involved in contacts with an interaction partner were delineated from atomic structures of transient interaction complexes and juxtaposed with the location of alternatively spliced regions detected by comparative genome analysis and spliced alignment. The total of 42 alternatively spliced isoforms were identified in 21 amino acid chains involved in biomolecular interactions. Using this limited dataset and a variety of sophisticated counting procedures we were not able to establish a statistically significant correlation between the positions of protein interaction sites and alternatively spliced regions. CONCLUSIONS: This finding contradicts a naïve hypothesis that alternatively spliced regions would correlate with points of contact. One possible explanation for that could be that all alternative splicing events change the spatial structure of the interacting domain to a sufficient degree to preclude interaction. This is indirectly supported by the observed lack of difference in the behaviour of relatively short regions affected by alternative splicing and cases when large portions of proteins are removed. More structural data on complexes of interacting proteins, including structures of alternative isoforms, are needed to test this conjecture

    The PEDANT genome database in 2005

    Get PDF
    The PEDANT genome database (http://pedant.gsf.de) contains pre-computed bioinformatics analyses of publicly available genomes. Its main mission is to provide robust automatic annotation of the vast majority of amino acid sequences, which have not been subjected to in-depth manual curation by human experts in high-quality protein sequence databases. By design PEDANT annotation is genome-oriented, making it possible to explore genomic context of gene products, and evaluate functional and structural content of genomes using a category-based query mechanism. At present, the PEDANT database contains exhaustive annotation of over 1 240 000 proteins from 270 eubacterial, 23 archeal and 41 eukaryotic genomes

    Elongation factor P (-like) protein and polyproline motifs

    Get PDF
    Two or more consecutive prolines induce ribosome stalling during translation. In bacteria the elongation factor P (EF-P) efficiently rescues the ribosome stalling and allows the protein biosynthesis to continue. A seven amino acids long loop between beta-strands β3/β4 is crucial for EF-P function. The residue at the tip of the loop is subjected to the post-translational modifications: lysine is lysylated or arginine is rhamnosylated. We have demonstrated that only those enzymes that are needed for specific post-translational modification of the tip are coded in the bacterial genome (EpmA, EpmB and EpmC proteins for EF-P with lysine and EarP- for those with arginine). Phylogenetic analysis has also unveiled an invariant proline in the -2 position of the tip of the loop in EF-Ps that utilize lysine modifications such as Escherichia coli. Bacteria with the arginine modification like Pseudomonas putida on the contrary have selected against it. Combining these observations with experimental evidence, we conclude that β3/β4 loop composition is important for functionalization of EF-P by chemically distinct modifications. Some bacterial genomes also code the elongation factor P-like (EfpL) protein that shares the same domain architecture with EF-P and has an extended loop of eight amino acid residues long. The evolution, sequence and the structure of EfpL protein have been extensively characterized. Using the assay based on luminescence emission and ribosomal profiles we have shown that EfpL can also relieve the arrest of the ribosome induced by polyproline motifs. We have also observed the negative correlation between the occurrence of the motif in the proteome of Escherichia coli and its stalling strength measured in luminescence assay. We hypothesize that motifs that cause strong ribosome stalling are disfavored in the protein sequences during evolution due to their impact on the dynamics of translation.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202

    Fold Designability, Distribution, and Disease

    Get PDF
    Fold designability has been estimated by the number of families contained in that fold. Here, we show that among orthologous proteins, sequence divergence is higher for folds with greater numbers of families. Folds with greater numbers of families also tend to have families that appear more often in the proteome and greater promiscuity (the number of unique “partner” folds that the fold is found with within the same protein). We also find that many disease-related proteins have folds with relatively few families. In particular, a number of these proteins are associated with diseases occurring at high frequency. These results suggest that family counts reflect how certain structures are distributed in nature and is an important characteristic associated with many human diseases

    Cloning and characterization of Enterobacter sakazakii pigment genes and in situ spectroscopic analysis of the pigment

    Get PDF
    Enterobacter sakazakii is considered an opportunistic foodborne pathogen that is characterized by formation of yellow-pigmented colonies. Because of the lack of basic knowledge about Enterobacter sakazakii genetics, the BAC approach and the heterologous expression of the pigment in Escherichia coli were used to elucidate the molecular structure of the genes responsible for pigment production in Enterobacter sakazakii strain ES5. Sequencing and annotation of a 33.025 bp fragment revealed seven ORFs that could be assigned to the carotenoid biosynthesis pathway. The gene cluster had the organization crtE-idi-XYIBZ, with the crtE-idi-XYIB genes putatively transcribed as an operon and the crtZ gene transcribed in the opposite orientation. The carotenogenic nature of the pigment of Enterobacter sakazakii wt was ascertained by in situ analysis using visible microspectroscopy and resonance Raman microspectroscop
    corecore