3,413 research outputs found

    Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

    DotAligner:Identification and clustering of RNA structure motifs

    Get PDF
    Abstract The diversity of processed transcripts in eukaryotic genomes poses a challenge for the classification of their biological functions. Sparse sequence conservation in non-coding sequences and the unreliable nature of RNA structure predictions further exacerbate this conundrum. Here, we describe a computational method, DotAligner, for the unsupervised discovery and classification of homologous RNA structure motifs from a set of sequences of interest. Our approach outperforms comparable algorithms at clustering known RNA structure families, both in speed and accuracy. It identifies clusters of known and novel structure motifs from ENCODE immunoprecipitation data for 44 RNA-binding proteins

    Transcriptome Analysis for Non-Model Organism: Current Status and Best-Practices

    Get PDF
    Since transcriptome analysis provides genome-wide sequence and gene expression information, transcript reconstruction using RNA-Seq sequence reads has become popular during recent years. For non-model organism, as distinct from the reference genome-based mapping, sequence reads are processed via de novo transcriptome assembly approaches to produce large numbers of contigs corresponding to coding or non-coding, but expressed, part of genome. In spite of immense potential of RNA-Seq–based methods, particularly in recovering full-length transcripts and spliced isoforms from short-reads, the accurate results can be only obtained by the procedures to be taken in a step-by-step manner. In this chapter, we aim to provide an overview of the state-of-the-art methods including (i) quality check and pre-processing of raw reads, (ii) the pros and cons of de novo transcriptome assemblers, (iii) generating non-redundant transcript data, (iv) current quality assessment tools for de novo transcriptome assemblies, (v) approaches for transcript abundance and differential expression estimations and finally (vi) further mining of transcriptomic data for particular biological questions. Our intention is to provide an overview and practical guidance for choosing the appropriate approaches to best meet the needs of researchers in this area and also outline the strategies to improve on-going projects

    Transcriptome Profiling and Long Non-Coding Rna Identification in Grapevine

    Get PDF
    Next-generation sequencing technologies have provided access to vast quantities of nucleic acid sequence data. The resulting wealth of information enables biologists to address complex biological questions in species for which a high-quality well-annotated reference genome sequence has yet to be generated. The cultivated grapevine, Vitis vinifera, has a relatively poorly annotated reference genome. In addition, it is a highly heterozygous species which further hinders the annotation of its genome and the characterization of its transcriptome. Here, I annotated Version 2 of the 12X V. vinifera genome using RNA-seq data derived from the variety ‘Riesling\u27 by employing the most up-to-date computational methods. The results provide the first annotation of ‘Riesling\u27 and the first profile of its transcriptome in relation to the reference transcriptome of the model grape variety ‘Pinot Noir\u27. In addition, I develop a computational pipeline for the identification of long non-coding RNAs (lncRNAs) in non-model plant species that lack well-sequenced reference genomes. This pipeline was then applied to ‘Riesling\u27 RNA- seq data for the first analysis of lncRNAs in that variety

    Phylogenetic Position of the Acariform Mites: Sensitivity to Homology Assessment under Total Evidence

    Get PDF
    Background: Mites (Acari) have traditionally been treated as monophyletic, albeit composed of two major lineages: Acariformes and Parasitiformes. Yet recent studies based on morphology, molecular data, or combinations thereof, have increasingly drawn their monophyly into question. Furthermore, the usually basal (molecular) position of one or both mite lineages among the chelicerates is in conflict to their morphology, and to the widely accepted view that mites are close relatives of Ricinulei. Results: The phylogenetic position of the acariform mites is examined through employing SSU, partial LSU sequences, and morphology from 91 chelicerate extant terminals (forty Acariformes). In a static homology framework, molecular sequences were aligned using their secondary structure as guide, whereby regions of ambiguous alignment were discarded, and pre-aligned sequences analyzed under parsimony and different mixed models in a Bayesian inference. Parsimony and Bayesian analyses led to trees largely congruent concerning infraordinal, well-supported branches, but with low support for inter-ordinal relationships. An exception is Solifugae + Acariformes (P. P = 100%, J. = 0.91). In a dynamic homology framework, two analyses were run: a standard POY analysis and an analysis constrained by secondary structure. Both analyses led to largely congruent trees; supporting a (Palpigradi (Solifugae Acariformes)) clade and Ricinulei as sister group of Tetrapulmonata with the topology (Ricinulei (Amblypygi (Uropygi Araneae))). Combined analysis with two different morphological data matrices were run in order to evaluate the impact of constraining the analysis on the recovered topology when employing secondary structure as a guide for homology establishment. The constrained combined analysis yielded two topologies similar to the exclusively molecular analysis for both morphological matrices, except for the recovery of Pedipalpi instead of the (Uropygi Araneae) clade. The standard (direct optimization) POY analysis, however, led to the recovery of trees differing in the absence of the otherwise well-supported group Solifugae + Acariformes. Conclusions: Previous studies combining ribosomal sequences and morphology often recovered topologies similar to purely morphological analyses of Chelicerata. The apparent stability of certain clades not recovered here, like Haplocnemata and Acari, is regarded as a byproduct of the way the molecular homology was previously established using the instrumentalist approach implemented in POY. Constraining the analysis by a priori homology assessment is defended here as a way of maintaining the severity of the test when adding new data to the analysis. Although the strength of the method advocated here is keeping phylogenetic information from regions usually discarded in an exclusively static homology framework; it still has the inconvenience of being uninformative on the effect of alignment ambiguity on resampling methods of clade support estimation. Finally, putative morphological apomorphies of Solifugae + Acariformes are the reduction of the proximal cheliceral podomere, medial abutting of the leg coxae, loss of sperm nuclear membrane, and presence of differentiated germinative and secretory regions in the testis delivering their products into a common lumen

    Strategies for measuring evolutionary conservation of RNA secondary structures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Evolutionary conservation of RNA secondary structure is a typical feature of many functional non-coding RNAs. Since almost all of the available methods used for prediction and annotation of non-coding RNA genes rely on this evolutionary signature, accurate measures for structural conservation are essential.</p> <p>Results</p> <p>We systematically assessed the ability of various measures to detect conserved RNA structures in multiple sequence alignments. We tested three existing and eight novel strategies that are based on metrics of folding energies, metrics of single optimal structure predictions, and metrics of structure ensembles. We find that the folding energy based SCI score used in the RNAz program and a simple base-pair distance metric are by far the most accurate. The use of more complex metrics like for example tree editing does not improve performance. A variant of the SCI performed particularly well on highly conserved alignments and is thus a viable alternative when only little evolutionary information is available. Surprisingly, ensemble based methods that, in principle, could benefit from the additional information contained in sub-optimal structures, perform particularly poorly. As a general trend, we observed that methods that include a consensus structure prediction outperformed equivalent methods that only consider pairwise comparisons.</p> <p>Conclusion</p> <p>Structural conservation can be measured accurately with relatively simple and intuitive metrics. They have the potential to form the basis of future RNA gene finders, that face new challenges like finding lineage specific structures or detecting mis-aligned sequences.</p

    Uncovering structural genomic contents of wheat

    Get PDF
    Production rate of wheat, an important food source worldwide, is significantly limited by both biotic and abiotic stress factors. Development of stress resistant cultivars are highly dependent on the understanding of the molecular mechanisms and structural elements in wheat and/or wheat interacting species. The huge and complex genome of bread wheat (BBAADD genome) has stood as a vital obstruction for understanding the molecular mechanisms until the recent availability of wheat reference genome. In this study, we provided improved and/or novel methodologies to reveal structural elements in plants. These methodologies include miRNA identification, manual curation of lncRNAs, identification of lncRNAs using wheat specific prediction models and a comparative analysis of WES data analysis tools. Using these techniques, we here focused on the uncovering of structural genomic contents of wheat. With an improved identification methodologies and manual annotation of lncRNAs, we revealed several miRNAs and lncRNAs in Triticum turgidum species and Wheat stem sawfly (WSS), a major pest of wheat. We provided a comprehensive transcriptome analysis of tetraploid wheat varieties and revealed drought responsive transcripts. Additionally, we presented the first clues of miRNA mobility between WSS larva and hexaploid wheat. Thereby, besides enrichment of the genetic information available for wheat species, this study provides important elements driving both abiotic and biotic stress responses in wheat. In this study, we also applied machine learning approaches for the fast and accurate prediction of lncRNAs in wheat species. With annotated genomes of hexaploid and tetraploid wheats, we provided better accuracy scores (99.81%) over the most popular tools available. Finally, we conducted a comparative analysis of the tools used for variant discovery. Among eight aligners and three callers, we chose the best combination for the variant calling in wheat. Later, we performed variant calling in 48 lines of elite wheat cultivars using the best tool sets. Overall, this study focused on the improvements on the identification of miRNAs, lncRNAs and structural variations in whea

    Linking gene expression and orthology in mammals

    Get PDF
    The overall aim of biomedical research is to understand disease mechanisms and to provide a drug to eventually cure the disease. This challenging endeavour requires an early research phase that deals with identifying target genes or proteins playing an important role in the disease. At this stage one uses animal models mimicking human disease to determine differences between healthy and diseased animals. Once potential drug targets have been found, compounds are screened and promising compounds go into the preclinical phase where their efficacy and, most importantly, safety are assessed. Those having proven to be efficacious and safe proceed to toxicology where the maximum tolerable dosage is assessed in, mainly, non-rodent species. According to the Bundesministerium für Ernährung und Landwirtschaft, more than 2 million animals were used for animal testing in German laboratories in 2017. The majority of these animals were mice and rats but also dogs, cats and monkeys are model organisms used for testing. While it is commonly accepted that other mammalian species resemble human biology to a great extent, one has to bear in mind that there are species-specific differences. One of the aims of this thesis was to investigate how similar widely used model species are to human and to each other on a molecular level. For this purpose we assessed the relationship between protein sequence identity and gene expression correlation with an emphasis on mouse and rat. We found that the majority of genes are highly similar, both on sequence and gene expression level. There were, however, cases with low sequence identity but high expression correlation. These cases were investigated in greater detail and the hypothesis that sequences annotated in widely used databases like Ensembl, UniProt, or RefSeq, may contain errors or are incomplete, was confirmed. Therefore, we investigated whether sequence information from related species can be used to derive a target’s sequence in a species with poor annotation. The a&o-tool was developed to exploit sequence similarity between related species and short-read RNA-Seq data to refine or validate target sequences. Since longread RNA-Seq data would greatly improve the results as entire transcripts are sequenced as a whole, we conducted a pilot study for comparing short- and long-read sequencing data. Even though PacBio’s SMRT sequencing technology still shows some issues with respect to data quality, it is a very promising approach that is going to prove valuable for sequence refinement. Another important goal of this thesis was to develop a score to assess a human target’s conservation across several model species. Publicly available data on the homology relationships between genes and RNA-Seq data build the basis for this score. Using a set of presumably highly conserved genes in human and mouse, we found that the proposed score yields reasonable results. An enrichment of Gene Ontology terms further strengthened our confidence in the conservation score

    Islands of linkage in an ocean of pervasive recombination reveals two-speed evolution of human cytomegalovirus genomes

    Get PDF
    Human cytomegalovirus (HCMV) infects most of the population worldwide, persisting throughout the host's life in a latent state with periodic episodes of reactivation. While typically asymptomatic, HCMV can cause fatal disease among congenitally infected infants and immunocompromised patients. These clinical issues are compounded by the emergence of antiviral resistance and the absence of an effective vaccine, the development of which is likely complicated by the numerous immune evasins encoded by HCMV to counter the host's adaptive immune responses, a feature that facilitates frequent super-infections. Understanding the evolutionary dynamics of HCMV is essential for the development of effective new drugs and vaccines. By comparing viral genomes from uncultivated or low-passaged clinical samples of diverse origins, we observe evidence of frequent homologous recombination events, both recent and ancient, and no structure of HCMV genetic diversity at the whole-genome scale. Analysis of individual gene-scale loci reveals a striking dichotomy: while most of the genome is highly conserved, recombines essentially freely and has evolved under purifying selection, 21 genes display extreme diversity, structured into distinct genotypes that do not recombine with each other. Most of these hyper-variable genes encode glycoproteins involved in cell entry or escape of host immunity. Evidence that half of them have diverged through episodes of intense positive selection suggests that rapid evolution of hyper-variable loci is likely driven by interactions with host immunity. It appears that this process is enabled by recombination unlinking hyper-variable loci from strongly constrained neighboring sites. It is conceivable that viral mechanisms facilitating super-infection have evolved to promote recombination between diverged genotypes, allowing the virus to continuously diversify at key loci to escape immune detection, while maintaining a genome optimally adapted to its asymptomatic infectious lifecycle
    corecore