4,213 research outputs found

    REPARATION : ribosome profiling assisted (re-)annotation of bacterial genomes

    Get PDF
    Prokaryotic genome annotation is highly dependent on automated methods, as manual curation cannot keep up with the exponential growth of sequenced genomes. Current automated methods depend heavily on sequence composition and often underestimate the complexity of the proteome. We developed RibosomeE Profiling Assisted (re-)AnnotaTION (REPARATION), a de novo machine learning algorithm that takes advantage of experimental protein synthesis evidence from ribosome profiling (Ribo-seq) to delineate translated open reading frames (ORFs) in bacteria, independent of genome annotation (https://github.com/Biobix/ REPARATION). REPARATION evaluates all possible ORFs in the genome and estimates minimum thresholds based on a growth curve model to screen for spurious ORFs. We applied REPARATION to three annotated bacterial species to obtain a more comprehensive mapping of their translation landscape in support of experimental data. In all cases, we identified hundreds of novel (small) ORFs including variants of previously annotated ORFs and >70% of all (variants of) annotated protein coding ORFs were predicted by REPARATION to be translated. Our predictions are supported by matching mass spectrometry proteomics data, sequence composition and conservation analysis. REPARATION is unique in that it makes use of experimental translation evidence to intrinsically perform a de novo ORF delineation in bacterial genomes irrespective of the sequence features linked to open reading frames

    EVOLUTION AND DYNAMICS OF TRANSCRIPTIONAL REGULATION IN BACTERIA

    Get PDF
    Although transcription is one of the most important biological functions of cells, our understanding of its regulation is still limited. In this dissertation, we have studied the transcriptional regulation in prokaryotes in three aspects. First, we investigated the extent to which cis-regulatory elements are conserved during the course of evolution using the LexA regulons in cyanobacteria as an example. We found that in most cyanobacterial genomes analyzed, LexA appears to function as the transcriptional regulator of the key SOS response genes. The loss of lexA in some genomes might lead to the degradation of its binding sites. Second, directional RNA-seq techniques have recently become the workhorse for transcriptome profiling in prokaryotes, however, it is a challenging task to accurately assemble highly labile prokaryotic transcriptomes for further analyses. To fill this gap, we have developed a hidden Markov model based transcriptome assembler which outperforms the state-of-the-art assemblers. Using our tool, we characterized alternative operon structures in E. coli K12 under various growth conditions and growth phases, and found that they are more complex and dynamic than previously anticipated. Lastly, we determined anti-sense and non-coding transcription patterns in E. coli K12 under various growth conditions and time points. We found that a large portion of genes have antisense transcription in a condition-dependent manner. Most antisense transcripts are initiated and restricted to the 5?-end of the gene on the sense strand, and their expression levels are correlated with those of the genes on the sense strand, suggesting that these antisense transcripts might play an important role in transcriptional regulation

    Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

    Get PDF
    The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing

    Genome sequence of the Lebeckia ambigua-nodulating 'Burkholderia sprentiae' strain WSM5005T

    Get PDF
    "Burkholderia sprentiae" strain WSM5005(T) is an aerobic, motile, Gram-negative, non-sporeforming rod that was isolated in Australia from an effective N-2-fixing root nodule of Lebeckia ambigua collected in Klawer, Western Cape of South Africa, in October 2007. Here we describe the features of "Burkholderia sprentiae" strain WSM5005T, together with the genome sequence and its annotation. The 7,761,063 bp high-quality-draft genome is arranged in 8 scaffolds of 236 contigs, contains 7,147 protein-coding genes and 76 RNA-only encoding genes, and is one of 20 rhizobial genomes sequenced as part of the DOE Joint Genome Institute 2010 Community Sequencing Program

    Genomic data mining for the computational prediction of small non-coding RNA genes

    Get PDF
    The objective of this research is to develop a novel computational prediction algorithm for non-coding RNA (ncRNA) genes using features computable for any genomic sequence without the need for comparative analysis. Existing comparative-based methods require the knowledge of closely related organisms in order to search for sequence and structural similarities. This approach imposes constraints on the type of ncRNAs, the organism, and the regions where the ncRNAs can be found. We have developed a novel approach for ncRNA gene prediction without the limitations of current comparative-based methods. Our work has established a ncRNA database required for subsequent feature and genomic analysis. Furthermore, we have identified significant features from folding-, structural-, and ensemble-based statistics for use in ncRNA prediction. We have also examined higher-order gene structures, namely operons, to discover potential insights into how ncRNAs are transcribed. Being able to automatically identify ncRNAs on a genome-wide scale is immensely powerful for incorporating it into a pipeline for large-scale genome annotation. This work will contribute to a more comprehensive annotation of ncRNA genes in microbial genomes to meet the demands of functional and regulatory genomic studies.Ph.D.Committee Chair: Dr. G. Tong Zhou; Committee Member: Dr. Arthur Koblasz; Committee Member: Dr. Eberhard Voit; Committee Member: Dr. Xiaoli Ma; Committee Member: Dr. Ying X

    A computational genomics pipeline for prokaryotic sequencing projects

    Get PDF
    Motivation: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data

    Genome sequence of the Ornithopus/Lupinus-nodulating Bradyrhizobium sp. strain WSM471

    Get PDF
    Bradyrhizobium sp. strain WSM471 is an aerobic, motile, Gram-negative, non-spore-forming rod that was isolated from an effective nitrogen-(N-2) fixing root nodule formed on the annual legume Ornithopus pinnatus (Miller) Druce growing at Oyster Harbour, Albany district, Western Australia in 1982. This strain is in commercial production as an inoculant for Lupinus and Ornithopus. Here we describe the features of Bradyrhizobium sp. strain WSM471, together with genome sequence information and annotation. The 7,784,016 bp high-quality-draft genome is arranged in 1 scaffold of 2 contigs, contains 7,372 protein-coding genes and 58 RNA-only encoding genes, and is one of 20 rhizobial genomes sequenced as part of the DOE Joint Genome Institute 2010 Community Sequencing Program

    Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. However, determining protein-coding genes for most new genomes is almost completely performed by inference using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function.</p> <p>Results</p> <p>We experimentally annotated the bacterial pathogen <it>Salmonella </it>Typhimurium 14028, using "shotgun" proteomics to accurately uncover the translational landscape and post-translational features. The data provide protein-level experimental validation for approximately half of the predicted protein-coding genes in <it>Salmonella </it>and suggest revisions to several genes that appear to have incorrectly assigned translational start sites, including a potential novel alternate start codon. Additionally, we uncovered 12 non-annotated genes missed by gene prediction programs, as well as evidence suggesting a role for one of these novel ORFs in <it>Salmonella </it>pathogenesis. We also characterized post-translational features in the <it>Salmonella </it>genome, including chemical modifications and proteolytic cleavages. We find that bacteria have a much larger and more complex repertoire of chemical modifications than previously thought including several novel modifications. Our <it>in vivo </it>proteolysis data identified more than 130 signal peptide and N-terminal methionine cleavage events critical for protein function.</p> <p>Conclusion</p> <p>This work highlights several ways in which application of proteomics data can improve the quality of genome annotations to facilitate novel biological insights and provides a comprehensive proteome map of <it>Salmonella </it>as a resource for systems analysis.</p
    corecore