771 research outputs found

    A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context.

    Get PDF
    An important step in understanding the regulation of a prokaryotic genome is the generation of its transcription unit map. The current strongest operon predictor depends on the distributions of intergenic distances (IGD) separating adjacent genes within and between operons. Unfortunately, experimental data on these distance distributions are limited to Escherichia coli and Bacillus subtilis. We suggest a new graph algorithmic approach based on comparative genomics to identify clusters of conserved genes independent of IGD and conservation of gene order. As a consequence, distance distributions of operon pairs for any arbitrary prokaryotic genome can be inferred. For E.coli, the algorithm predicts 854 conserved adjacent pairs with a precision of 85%. The IGD distribution for these pairs is virtually identical to the E.coli operon pair distribution. Statistical analysis of the predicted pair IGD distribution allows estimation of a genome-specific operon IGD cut-off, obviating the requirement for a training set in IGD-based operon prediction. We apply the method to a representative set of eight genomes, and show that these genome-specific IGD distributions differ considerably from each other and from the distribution in E.coli

    Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

    Get PDF
    The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing

    Leaderless genes in bacteria: clue to the evolution of translation initiation mechanisms in prokaryotes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Shine-Dalgarno (SD) signal has long been viewed as the dominant translation initiation signal in prokaryotes. Recently, leaderless genes, which lack 5'-untranslated regions (5'-UTR) on their mRNAs, have been shown abundant in archaea. However, current large-scale <it>in silico </it>analyses on initiation mechanisms in bacteria are mainly based on the SD-led initiation way, other than the leaderless one. The study of leaderless genes in bacteria remains open, which causes uncertain understanding of translation initiation mechanisms for prokaryotes.</p> <p>Results</p> <p>Here, we study signals in translation initiation regions of all genes over 953 bacterial and 72 archaeal genomes, then make an effort to construct an evolutionary scenario in view of leaderless genes in bacteria. With an algorithm designed to identify multi-signal in upstream regions of genes for a genome, we classify all genes into SD-led, TA-led and atypical genes according to the category of the most probable signal in their upstream sequences. Particularly, occurrence of TA-like signals about 10 bp upstream to translation initiation site (TIS) in bacteria most probably means leaderless genes.</p> <p>Conclusions</p> <p>Our analysis reveals that leaderless genes are totally widespread, although not dominant, in a variety of bacteria. Especially for <it>Actinobacteria </it>and <it>Deinococcus-Thermus</it>, more than twenty percent of genes are leaderless. Analyzed in closely related bacterial genomes, our results imply that the change of translation initiation mechanisms, which happens between the genes deriving from a common ancestor, is linearly dependent on the phylogenetic relationship. Analysis on the macroevolution of leaderless genes further shows that the proportion of leaderless genes in bacteria has a decreasing trend in evolution.</p

    Selenocysteine, pyrrolysine and the unique energy metabolism of methanogenic archaea

    Get PDF
    Methanogenic archaea are a group of strictly anaerobic microorganisms characterized by their strict dependence on the process of methanogenesis for energy conservation. Among the archaea, they are also the only known group synthesizing proteins containing selenocysteine or pyrrolysine. All but one of the known archaeal pyrrolysine-containing and all but two of the confirmed archaeal selenocysteine-containing protein are involved in methanogenesis. Synthesis of these proteins proceeds through suppression of translational stop codons but otherwise the two systems are fundamentally different. This paper highlights these differences and summarizes the recent developments in selenocysteine- and pyrrolysine-related research on archaea and aims to put this knowledge into the context of their unique energy metabolism

    Predicting the outer membrane proteome of Pasteurella multocida based on consensus prediction enhanced by results integration and manual confirmation

    Get PDF
    Background Outer membrane proteins (OMPs) of Pasteurella multocida have various functions related to virulence and pathogenesis and represent important targets for vaccine development. Various bioinformatic algorithms can predict outer membrane localization and discriminate OMPs by structure or function. The designation of a confident prediction framework by integrating different predictors followed by consensus prediction, results integration and manual confirmation will improve the prediction of the outer membrane proteome. Results In the present study, we used 10 different predictors classified into three groups (subcellular localization, transmembrane β-barrel protein and lipoprotein predictors) to identify putative OMPs from two available P. multocida genomes: those of avian strain Pm70 and porcine non-toxigenic strain 3480. Predicted proteins in each group were filtered by optimized criteria for consensus prediction: at least two positive predictions for the subcellular localization predictors, three for the transmembrane β-barrel protein predictors and one for the lipoprotein predictors. The consensus predicted proteins were integrated from each group into a single list of proteins. We further incorporated a manual confirmation step including a public database search against PubMed and sequence analyses, e.g. sequence and structural homology, conserved motifs/domains, functional prediction, and protein-protein interactions to enhance the confidence of prediction. As a result, we were able to confidently predict 98 putative OMPs from the avian strain genome and 107 OMPs from the porcine strain genome with 83% overlap between the two genomes. Conclusions The bioinformatic framework developed in this study has increased the number of putative OMPs identified in P. multocida and allowed these OMPs to be identified with a higher degree of confidence. Our approach can be applied to investigate the outer membrane proteomes of other Gram-negative bacteria

    Binary particle swarm optimization for operon prediction

    Get PDF
    An operon is a fundamental unit of transcription and contains specific functional genes for the construction and regulation of networks at the entire genome level. The correct prediction of operons is vital for understanding gene regulations and functions in newly sequenced genomes. As experimental methods for operon detection tend to be nontrivial and time consuming, various methods for operon prediction have been proposed in the literature. In this study, a binary particle swarm optimization is used for operon prediction in bacterial genomes. The intergenic distance, participation in the same metabolic pathway, the cluster of orthologous groups, the gene length ratio and the operon length are used to design a fitness function. We trained the proper values on the Escherichia coli genome, and used the above five properties to implement feature selection. Finally, our study used the intergenic distance, metabolic pathway and the gene length ratio property to predict operons. Experimental results show that the prediction accuracy of this method reached 92.1%, 93.3% and 95.9% on the Bacillus subtilis genome, the Pseudomonas aeruginosa PA01 genome and the Staphylococcus aureus genome, respectively. This method has enabled us to predict operons with high accuracy for these three genomes, for which only limited data on the properties of the operon structure exists

    Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes

    Get PDF
    Transcription factors (TFs) form large paralogous gene families and have complex evolutionary histories. Here, we ask whether putative orthologs of TFs, from bidirectional best BLAST hits (BBHs), are evolutionary orthologs with conserved functions. We show that BBHs of TFs from distantly related bacteria are usually not evolutionary orthologs. Furthermore, the false orthologs usually respond to different signals and regulate distinct pathways, while the few BBHs that are evolutionary orthologs do have conserved functions. To test the conservation of regulatory interactions, we analyze expression patterns. We find that regulatory relationships between TFs and their regulated genes are usually not conserved for BBHs in Escherichia coli K12 and Bacillus subtilis. Even in the much more closely related bacteria Vibrio cholerae and Shewanella oneidensis MR-1, predicting regulation from E. coli BBHs has high error rates. Using gene–regulon correlations, we identify genes whose expression pattern differs between E. coli and S. oneidensis. Using literature searches and sequence analysis, we show that these changes in expression patterns reflect changes in gene regulation, even for evolutionary orthologs. We conclude that the evolution of bacterial regulation should be analyzed with phylogenetic trees, rather than BBHs, and that bacterial regulatory networks evolve more rapidly than previously thought

    Recovery and characterization of viral diversity from aquatic short- and long-read metagenomes

    Get PDF
    Viruses are the most abundant biological entities in marine ecosystems and play an essential role in global biogeochemical cycles. They have important ecological functions as drivers of bacterial populations through lytic infections and contribute to bacterial genetic diversification. Unfortunately, their study is severely limited by the difficulty to culture and isolate them in lab conditions. Culture-independent techniques such as metagenomics can complement culture-based approaches to capture more phage diversity. However, the vast majority of viral sequences recovered through these methods are uncharacterized and therefore do not provide any information about their interactions with the bacterial community, a phenomenon that has been named “viral dark matter”. In this thesis, several bioinformatic techniques are applied to both short- and long-read metagenomic datasets to recover biological information from marine viral sequences contained therein. A pipeline for recovering viral sequences based on a reference genome was developed and applied to the study of myophages infecting the alphaproteobacterial SAR11 clade, one of the most abundant bacterioplankton groups in surface marine and freshwater ecosystems. We were able to recover 22 new genomes which include the first genomes of myophages infecting LD12, the SAR11 freshwater clade. These sequences are underrepresented in datasets derived from the viral fraction, suggesting a bias of either technical or biological nature. Surprisingly, this family of phages code for an operon which resembles the secretion system type VIII operon in Escherichia coli. The function of this phage operon is still unknown. Next, a long-read dataset from the Mediterranean Sea was explored for viral contigs to contrast phage recovery between long- and short-read datasets. The analysis revealed that while long-read assemblies resulted in viral sequences of better quality, there was a sizable amount of intra-clade viral diversity that was not included in the assemblies. This viral diversity only found in long reads is even greater than previously thought. This untapped diversity could aid biotechnological efforts as evidenced by the discovery of new endolysins. Finally, a tool (Random Forest Assignment of Hosts, or RaFAH) for assigning hosts to phage sequences obtained from metagenomic datasets was created. The tool is based on a machine learning tool trained with phage protein clusters generated de novo. Benchmarking shows that RaFAH is on par with other state-of-the-art classifiers and is able to classify phage contigs at the level of Kingdom, which makes it the first classifier to accurately detect Archaea viruses from metagenomic samples. A feature importance analysis reveals that the protein clusters with the most predictive power are those involved in host recognition.Los bacteriófagos (”fagos”) son los organismos más abundantes en los ecosistemas marinos y tienen un papel esencial en los ciclos biogeoquímicos globales. Asimismo, influencian la evolución de las poblaciones bacterianas que infectan y contribuyen a la diversificación del acervo genético bacteriano. Desgraciadamente, su estudio se ve limitado por la dificultad de cultivar y aislar estos organismos en el laboratorio. El uso de técnicas que no requieren cultivo, como la metagenómica, pueden complementar el cultivo en laboratorio para recuperar una mayor diversidad de fagos. Sin embargo, la inmensa mayoría de secuencias virales recuperadas mediante metagenómica no pueden ser caracterizadas, por lo que no proporcionan ninguna información sobre sus interacciones con la comunidad bacteriana, un fenómeno que se ha nombrado “materia oscura viral”. En esta tesis se han utilizado múltiples procesos bioinformáticos en colecciones de metagenomas de lectura corta y larga para caracterizar las secuencias virales que contienen. Se ha desarrollado un procedimiento para recuperar secuencias virales a partir de un genoma de referencia y se ha aplicado al estudio de miofagos que infectan al clado SAR11 de las Alfaproteobacteria, uno de los grupos de bacterioplankton más abundantes en agua dulce y agua salada de superficie. Se consiguió recuperar 22 nuevos genomas que incluyen el primer genoma que infecta LD12, el subclado de SAR11 de agua dulce. Estos genomas están poco representados en colecciones obtenidas de la fracción viral, lo que sugiere que las afecta un sesgo técnico o biológico. Sorprendentemente, esta familia de fagos contiene un operón similar al sistema de secreción tipo VIII de Escherichia coli. La función de este operón es aún desconocida. Asimismo, se contrastó la recuperación de secuencias víricas entre colecciones de lectura corta y larga utilizando colecciones obtenidas en el mar Mediterráneo. Los resultados muestran que aunque los ensamblajes derivados de las lecturas largas producen secuencias virales de mejor calidad, en el proceso se pierde una gran cantidad de diversidad intraclado. Esta diversidad es mucho mayor de la recuperada con lecturas cortas, y podría explotarse para aplicaciones biotecnológicas, como el descubrimiento de nuevas endolisinas. Finalmente, se desarrolló un programa (Random Forest Assignment of Hosts, o RaFAH) para asignar hospedadores a secuencias virales obtenidas de colecciones metagenómicas. El programa se basa en el uso de algoritmos de machine learning entrenados con grupos de proteínas creados de novo. RaFAH muestra un rendimiento similar a otros clasificadores de secuencias y es capaz de clasificar secuencias víricas al nivel taxonómico de Reino, siendo así el primer clasificador capaz de detectar fagos que infectan arqueas con precisión. El análisis de importancia de rasgo revela que los grupos de proteínas con mayor poder predictivo son aquellos involucrados en el reconocimiento del hospedador

    Predicting Selective RNA Processing and Stabilization Operons in Clostridium spp.

    Get PDF
    In selective RNA processing and stabilization (SRPS) operons, stem–loops (SLs) located at the 3′-UTR region of selected genes can control the stability of the corresponding transcripts and determine the stoichiometry of the operon. Here, for such operons, we developed a computational approach named SLOFE (stem–loop free energy) that identifies the SRPS operons and predicts their transcript- and protein-level stoichiometry at the whole-genome scale using only the genome sequence via the minimum free energy (ΔG) of specific SLs in the intergenic regions within operons. As validated by the experimental approach of differential RNA-Seq, SLOFE identifies genome-wide SRPS operons in Clostridium cellulolyticum with 80% accuracy and reveals that the SRPS mechanism contributes to diverse cellular activities. Moreover, in the identified SRPS operons, SLOFE predicts the transcript- and protein-level stoichiometry, including those encoding cellulosome complexes, ATP synthases, ABC transporter family proteins, and ribosomal proteins. Its accuracy exceeds those of existing in silico approaches in C. cellulolyticum, Clostridium acetobutylicum, Clostridium thermocellum, and Bacillus subtilis. The ability to identify genome-wide SRPS operons and predict their stoichiometry via DNA sequence in silico should facilitate studying the function and evolution of SRPS operons in bacteria
    corecore