184 research outputs found

    k mer

    No full text
    Motivation: De novo transcriptome assembly is an integral part for many RNA-seq workflows. Common applications include sequencing of non-model organisms, cancer or meta transcriptomes. Most de novo transcriptome assemblers use the de Bruijn graph (DBG) as the underlying data structure. The quality of the assemblies produced by such assemblers is highly influenced by the exact word length k. As such no single kmer value leads to optimal results. Instead, DBGs over different kmer values are built and the assemblies are merged to improve sensitivity. However, no studies have investigated thoroughly the problem of automatically learning at which kmer value to stop the assembly. Instead a suboptimal selection of kmer values is often used in practice. Results: Here we investigate the contribution of a single kmer value in a multi-kmer based assembly approach. We find that a comparative clustering of related assemblies can be used to estimate the importance of an additional kmer assembly. Using a model fit based algorithm we predict the kmer value at which no further assemblies are necessary. Our approach is tested with different de novo assemblers for datasets with different coverage values and read lengths. Further, we suggest a simple post processing step that significantly improves the quality of multi-kmer assemblies. Conclusion: We provide an automatic method for limiting the number of kmer values without a significant loss in assembly quality but with savings in assembly time. This is a step forward to making multi-kmer methods more reliable and easier to use. Availability and Implementation:A general implementation of our approach can be found under: https://github.com/SchulzLab/KREATION. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: [email protected]

    A consensus‑based ensemble approach to improve transcriptome assembly

    Get PDF
    Background: Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes. Results: In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble. Conclusions: Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genomeguided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http:// bioin folab. unl. edu/ emlab/ conse mble/

    Comparative Genomic Characterization of the Multimammate Mouse Mastomys coucha.

    Get PDF
    Mastomys are the most widespread African rodent and carriers of various diseases such as the plague or Lassa virus. In addition, mastomys have rapidly gained a large number of mammary glands. Here, we generated a genome, variome, and transcriptomes for Mastomys coucha. As mastomys diverged at similar times from mouse and rat, we demonstrate their utility as a comparative genomic tool for these commonly used animal models. Furthermore, we identified over 500 mastomys accelerated regions, often residing near important mammary developmental genes or within their exons leading to protein sequence changes. Functional characterization of a noncoding mastomys accelerated region, located in the HoxD locus, showed enhancer activity in mouse developing mammary glands. Combined, our results provide genomic resources for mastomys and highlight their potential both as a comparative genomic tool and for the identification of mammary gland number determining factors

    PARSES: A Pipeline for Analysis of RNA-Sequencing Exogenous Sequences

    Get PDF
    RNA-Sequencing (RNA-Seq) has become one of the most widely used techniques to interrogate the transcriptome of an organism since the advent of next generation sequencing technologies [1]. A plethora of tools have been developed to analyze and visualize the transcriptome data from RNA-Seq experiments, solving the problem of mapping reads back to the host organism\u27s genome [2] [3]. This allows for analysis of most reads produced by the experiments, but these tools typically discard reads that do not match well with the reference genome. This additional information could reveal important insight into the experiment and possible contributing factors to the condition under consideration. We introduce PARSES, a pipeline constructed from existing sequence analysis tools, which allows the user to interrogate RNA-Sequencing experiments for possible biological contamination or the presence of exogenous sequences that may shed light on other factors influencing an organism\u27s condition

    PARSES: A Pipeline for Analysis of RNA-Sequencing Exogenous Sequences

    Get PDF
    RNA-Sequencing (RNA-Seq) has become one of the most widely used techniques to interrogate the transcriptome of an organism since the advent of next generation sequencing technologies [1]. A plethora of tools have been developed to analyze and visualize the transcriptome data from RNA-Seq experiments, solving the problem of mapping reads back to the host organism\u27s genome [2] [3]. This allows for analysis of most reads produced by the experiments, but these tools typically discard reads that do not match well with the reference genome. This additional information could reveal important insight into the experiment and possible contributing factors to the condition under consideration. We introduce PARSES, a pipeline constructed from existing sequence analysis tools, which allows the user to interrogate RNA-Sequencing experiments for possible biological contamination or the presence of exogenous sequences that may shed light on other factors influencing an organism\u27s condition

    The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly

    No full text
    Characterizing transcriptomes in non-model organisms has resulted in a massive increase in our understanding of biological phenomena. This boon, largely made possible via high-throughput sequencing, means that studies of functional, evolutionary, and population genomics are now being done by hundreds or even thousands of labs around the world. For many, these studies begin with a de novo transcriptome assembly, which is a technically complicated process involving several discrete steps. The Oyster River Protocol (ORP), described here, implements a standardized and benchmarked set of bioinformatic processes, resulting in an assembly with enhanced qualities over other standard assembly methods. Specifically, ORP produced assemblies have higher Detonate and TransRate scores and mapping rates, which is largely a product of the fact that it leverages a multi-assembler and kmer assembly process, thereby bypassing the shortcomings of any one approach. These improvements are important, as previously unassembled transcripts are included in ORP assemblies, resulting in a significant enhancement of the power of downstream analysis. Further, as part of this study, I show that assembly quality is unrelated with the number of reads generated, above 30 million reads. Code Availability: The version controlled open-source code is available at https://github.com/macmanes-lab/Oyster_River_Protocol. Instructions for software installation and use, and other details are available at http://oyster-river-protocol.rtfd.org/

    Transcriptome Analysis for Non-Model Organism: Current Status and Best-Practices

    Get PDF
    Since transcriptome analysis provides genome-wide sequence and gene expression information, transcript reconstruction using RNA-Seq sequence reads has become popular during recent years. For non-model organism, as distinct from the reference genome-based mapping, sequence reads are processed via de novo transcriptome assembly approaches to produce large numbers of contigs corresponding to coding or non-coding, but expressed, part of genome. In spite of immense potential of RNA-Seq–based methods, particularly in recovering full-length transcripts and spliced isoforms from short-reads, the accurate results can be only obtained by the procedures to be taken in a step-by-step manner. In this chapter, we aim to provide an overview of the state-of-the-art methods including (i) quality check and pre-processing of raw reads, (ii) the pros and cons of de novo transcriptome assemblers, (iii) generating non-redundant transcript data, (iv) current quality assessment tools for de novo transcriptome assemblies, (v) approaches for transcript abundance and differential expression estimations and finally (vi) further mining of transcriptomic data for particular biological questions. Our intention is to provide an overview and practical guidance for choosing the appropriate approaches to best meet the needs of researchers in this area and also outline the strategies to improve on-going projects

    Investigating the Evolutionary Dynamics of Traits in Metazoa

    Get PDF
    Over the last 800 million years, animals have evolved an incredible array of diverse forms, life histories, ecologies, and traits. In the age of genome-scale resources for many animal taxa, researchers have a unique opportunity to investigate animal diversity and evolution through comparative genomic methods. These methods allow for studies not only of current diversity and evolutionary relationships, but also of ancient evolutionary dynamics and genomic repertoire. In order to study the evolution of diverse animal traits in a rigorous way however, researchers must not neglect the fundamental components of a robust comparative genomics study: well-supported phylogenies, high-quality genomic resources, and ways of applying comparative genomic methods to a phylogenetic tree. Here, I present three studies of animal trait evolution that address each of the three components above. First, I have leveraged current bioinformatic technologies to identify biases in phylogenomic studies stemming from transcriptome assembly errors, and determined the best practices for processing transcriptomic data for these studies (Chapter 1). I found that high-quality transcriptome assemblies yield richer datasets that are less prone to bias and ambiguity when used to create phylogenetic trees. Second, I have sequenced and assembled a new genomic dataset from a unique marine organism which occupies a crucial position for Cnidarian phylogeny (Chapter 2). This new genomic resource is an important contribution to studies of the evolution of novel cell types and mitochondrial structure. Third, I have investigated the patterns of gene gain and loss that characterize the evolution of one of the earliest-branching metazoan lineages in a well-supported phylogenomic context (Chapter 3). I established that animals in the phylum Porifera have lost traits associated with most other animal lineages, resulting in a derived form in extant sponges. The findings I lay out in this dissertation add to the growing body of knowledge concerning the evolution of non-bilaterian and early-branching metazoan lineages while also providing the scientific community with best practices for the accurate study of diverse traits in Metazoa

    From Pieces To Paths: Combining Disparate Information in Computational Analysis of RNA-Seq.

    Get PDF
    As high-throughput sequencing technology has advanced in recent decades, large-scale genomic data with high-resolution have been generated for solving various problems in many felds. One of the state-of-the-art sequencing techniques is RNA sequencing, which has been widely used to study the transcriptomes of biological systems through millions of reads. The ultimate goal of RNA sequencing bioinformatics algorithms is to maximally utilize the information stored in a large amount of pieced-together reads to unveil the whole landscape of biological function at the transcriptome level. Many bioinformatics methods and pipelines have been developed for better achieving this goal. However, one central question of RNA sequencing is the prediction uncertainty due to the short read length and the low sampling rate of underexpressed transcripts. Both conditions raise ambiguities in read mapping, transcript assembly, transcript quantifcation, and even the downstream analysis. This dissertation focuses on approaches to reducing the above uncertainty by incorporating additional information, of disparate kinds, into bioinformatics models and modeling assessments. I addressed three critical issues in RNA sequencing data analysis. (1) we evaluated the performance of current de novo assembly methods and their evaluation methods using the transcript information from a third generation sequencing platform, which provides a longer sequence length but with a higher error rate than next-generation sequencing; (2) we built a Bayesian graphical model for improving transcript quantifcation and di˙erentially expressed isoform identifcation by utilizing the shared information from biological replicates; (3) we built a joint pathway and gene selection model by incorporating pathway structures from an expert database. We conclude that the incorporation of appropriate information from extra resources enables a more reliable assessment and a higher prediction performance in RNA sequencing data analysis

    Hybrid genome assembly and annotation of Danionella translucida

    Get PDF
    Studying neuronal circuits at cellular resolution is very challenging in vertebrates due to the size and optical turbidity of their brains. Danionella translucida, a close relative of zebrafish, was recently introduced as a model organism for investigating neural network interactions in adult individuals. Danionella remains transparent throughout its life, has the smallest known vertebrate brain and possesses a rich repertoire of complex behaviours. Here we sequenced, assembled and annotated the Danionella translucida genome employing a hybrid Illumina/Nanopore read library as well as RNA-seq of embryonic, larval and adult mRNA. We achieved high assembly continuity using low-coverage long-read data and annotated a large fraction of the transcriptome. This dataset will pave the way for molecular research and targeted genetic manipulation of this novel model organism
    • …
    corecore