38 research outputs found

    Lost in folding space? Comparing four variants of the thermodynamic model for RNA secondary structure prediction

    Get PDF
    Janssen S, Schudoma C, Steger G, Giegerich R. Lost in folding space? Comparing four variants of the thermodynamic model for RNA secondary structure prediction. BMC Bioinformatics. 2011;12(1): 429.BACKGROUND:Many bioinformatics tools for RNA secondary structure analysis are based on a thermodynamic model of RNA folding. They predict a single, "optimal" structure by free energy minimization, they enumerate near-optimal structures, they compute base pair probabilities and dot plots, representative structures of different abstract shapes, or Boltzmann probabilities of structures and shapes. Although all programs refer to the same physical model, they implement it with considerable variation for different tasks, and little is known about the effects of heuristic assumptions and model simplifications used by the programs on the outcome of the analysis.RESULTS:We extract four different models of the thermodynamic folding space which underlie the programs RNAfold, RNAshapes, and RNAsubopt. Their differences lie within the details of the energy model and the granularity of the folding space. We implement probabilistic shape analysis for all models, and introduce the shape probability shift as a robust measure of model similarity. Using four data sets derived from experimentally solved structures, we provide a quantitative evaluation of the model differences.CONCLUSIONS:We find that search space granularity affects the computed shape probabilities less than the over- or underapproximation of free energy by a simplified energy model. Still, the approximations perform similar enough to implementations of the full model to justify their continued use in settings where computational constraints call for simpler algorithms. On the side, we observe that the rarely used level 2 shapes, which predict the complete arrangement of helices, multiloops, internal loops and bulges, include the "true" shape in a rather small number of predicted high probability shapes. This calls for an investigation of new strategies to extract high probability members from the (very large) level 2 shape space of an RNA sequence. We provide implementations of all four models, written in a declarative style that makes them easy to be modified. Based on our study, future work on thermodynamic RNA folding may make a choice of model based on our empirical data. It can take our implementations as a starting point for further program development

    proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes

    Full text link
    The interpretation of genomic, transcriptomic and other microbial 'omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/

    Sequence–structure relationships in RNA loops: establishing the basis for loop homology modeling

    Get PDF
    The specific function of RNA molecules frequently resides in their seemingly unstructured loop regions. We performed a systematic analysis of RNA loops extracted from experimentally determined three-dimensional structures of RNA molecules. A comprehensive loop-structure data set was created and organized into distinct clusters based on structural and sequence similarity. We detected clear evidence of the hallmark of homology present in the sequence–structure relationships in loops. Loops differing by <25% in sequence identity fold into very similar structures. Thus, our results support the application of homology modeling for RNA loop model building. We established a threshold that may guide the sequence divergence-based selection of template structures for RNA loop homology modeling. Of all possible sequences that are, under the assumption of isosteric relationships, theoretically compatible with actual sequences observed in RNA structures, only a small fraction is contained in the Rfam database of RNA sequences and classes implying that the actual RNA loop space may consist of a limited number of unique loop structures and conserved sequences. The loop-structure data sets are made available via an online database, RLooM. RLooM also offers functionalities for the modeling of RNA loop structures in support of RNA engineering and design efforts

    Bioaccumulation in aquatic systems: methodological approaches, monitoring and assessment

    Get PDF
    Bioaccumulation, the accumulation of a chemical in an organism relative to its level in the ambient medium, is of major environmental concern. Thus, monitoring chemical concentrations in biota are widely and increasingly used for assessing the chemical status of aquatic ecosystems. In this paper, various scientific and regulatory aspects of bioaccumulation in aquatic systems and the relevant critical issues are discussed. Monitoring chemical concentrations in biota can be used for compliance checking with regulatory directives, for identification of chemical sources or event related environmental risk assessment. Assessing bioaccumulation in the field is challenging since many factors have to be considered that can effect the accumulation of a chemical in an organism. Passive sampling can complement biota monitoring since samplers with standardised partition properties can be used over a wide temporal and geographical range. Bioaccumulation is also assessed for regulation of chemicals of environmental concern whereby mainly data from laboratory studies on fish bioaccumulation are used. Field data can, however, provide additional important information for regulators. Strategies for bioaccumulation assessment still need to be harmonised for different regulations and groups of chemicals. To create awareness for critical issues and to mutually benefit from technical expertise and scientific findings, communication between risk assessment and monitoring communities needs to be improved. Scientists can support the establishment of new monitoring programs for bioaccumulation, e.g. in the frame of the amended European Environmental Quality Standard Directive

    How to sequence 10,000 bacterial genomes and retain your sanity: an accessible, efficient and global approach

    Get PDF
    Non-typhoidal Salmonella(NTS)are typically associated with enterocolitis and linked to the industrialisation of food production. In recent years, NTS has been associated with invasive disease (iNTS disease) causing an estimated 77,000 deaths each year worldwide; 80% of mortality occurs in sub-Saharan Africa. New clades of S. Typhimurium and S. Enteritidis have been identified, which are characterised by genomic degradation, altered prophage repertoires and novel multidrug resistant plasmids. To understand how these clades are contributing to the burden and severity of iNTS disease, it is crucial to expand genome-based surveillance to cover more countries, and incorporate historical isolates to generate an evolutionary timeline of the development of iNTS. We developedand validateda robust and inexpensive method for large-scale collection and sequencing of bacterial genomes. The “10,000 Salmonella genomes” project established a worldwide research collaboration to generate information relevant to the epidemiology, drug resistance and virulence factors of Salmonellae using a whole-genome sequencing approach. By streamlining collection of isolates and developing an efficient logistics pipeline, we gathered 10,419 clinical and environmental isolates from collections in low and middle-income countries within six months. Genome sequences are now available for isolates from 51 countries/territories dating from 1949 to 2017, with ~80 % representing African and Latin-American datasets. Our method can be applied to other large sample collections that require maximisation of resources within a limited timeframe. Detailed genome analyses are in progress and it is hoped that the resulting data will contribute to public health control strategies in low and middle-income countries

    An accessible, efficient and global approach for the large-scale sequencing of bacterial genomes

    Get PDF
    We have developed an efficient and inexpensive pipeline for streamlining large-scale collection and genome sequencing of bacterial isolates. Evaluation of this method involved a worldwide research collaboration focused on the model organism Salmonella enterica, the 10KSG consortium. Following the optimization of a logistics pipeline that involved shipping isolates as thermolysates in ambient conditions, the project assembled a diverse collection of 10,419 isolates from low- and middle-income countries. The genomes were sequenced using the LITE pipeline for library construction, with a total reagent cost of less than USD$10 per genome. Our method can be applied to other large bacterial collections to underpin global collaborations

    An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations

    Get PDF
    Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop

    Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Phosphorylation of proteins plays a crucial role in the regulation and activation of metabolic and signaling pathways and constitutes an important target for pharmaceutical intervention. Central to the phosphorylation process is the recognition of specific target sites by protein kinases followed by the covalent attachment of phosphate groups to the amino acids serine, threonine, or tyrosine. The experimental identification as well as computational prediction of phosphorylation sites (P-sites) has proved to be a challenging problem. Computational methods have focused primarily on extracting predictive features from the local, one-dimensional sequence information surrounding phosphorylation sites.</p> <p>Results</p> <p>We characterized the spatial context of phosphorylation sites and assessed its usability for improved phosphorylation site predictions. We identified 750 non-redundant, experimentally verified sites with three-dimensional (3D) structural information available in the protein data bank (PDB) and grouped them according to their respective kinase family. We studied the spatial distribution of amino acids around phosphorserines, phosphothreonines, and phosphotyrosines to extract signature 3D-profiles. Characteristic spatial distributions of amino acid residue types around phosphorylation sites were indeed discernable, especially when kinase-family-specific target sites were analyzed. To test the added value of using spatial information for the computational prediction of phosphorylation sites, Support Vector Machines were applied using both sequence as well as structural information. When compared to sequence-only based prediction methods, a small but consistent performance improvement was obtained when the prediction was informed by 3D-context information.</p> <p>Conclusion</p> <p>While local one-dimensional amino acid sequence information was observed to harbor most of the discriminatory power, spatial context information was identified as relevant for the recognition of kinases and their cognate target sites and can be used for an improved prediction of phosphorylation sites. A web-based service (Phos3D) implementing the developed structure-based P-site prediction method has been made available at <url>http://phos3d.mpimp-golm.mpg.de</url>.</p
    corecore