259 research outputs found

    A memory-efficient algorithm to obtain splicing graphs and de novo expression estimates from de Bruijn graphs of RNA-Seq data

    Get PDF
    The recent advance of high-throughput sequencing makes it feasible to study entire transcriptomes through the application of de novo sequence assembly algorithms. While a popular strategy is to first construct an intermediate de Bruijn graph structure to represent the transcriptome, an additional step is needed to construct predicted transcripts from the graph. Since the de Bruijn graph contains all branching possibilities, we develop a memory-efficient algorithm to recover alternative splicing information and library-specific expression information directly from the graph without prior genomic knowledge. We implement the algorithm as a postprocessing module of the Velvet assembler. We validate our algorithm by simulating the transcriptome assembly of Drosophila using its known genome, and by performing Drosophila transcriptome assembly using publicly available RNA-Seq libraries. Under a range of conditions, our algorithm recovers sequences and alternative splicing junctions with higher specificity than Oases or Trans-ABySS. Since our postprocessing algorithm does not consume as much memory as Velvet and is less memory-intensive than Oases, it allows biologists to assemble large libraries with limited computational resources. Our algorithm has been applied to perform transcriptome assembly of the non-model blow fly Lucilia sericata that was reported in a previous article, which shows that the assembly is of high quality and it facilitates comparison of the Lucilia sericata transcriptome to Drosophila and two mosquitoes, prediction and experimental validation of alternative splicing, investigation of differential expression among various developmental stages, and identification of transposable elements.The open access fee for this work was funded through the Texas A&M University Open Access to Knowledge (OAK) Fund

    Novel graph based algorithms for transcriptome sequence analysis

    Get PDF
    RNA-sequencing (RNA-seq) is one of the most-widely used techniques in molecular biology. A key bioinformatics task in any RNA-seq workflow is the assembling the reads. As the size of transcriptomics data sets is constantly increasing, scalable and accurate assembly approaches have to be developed.Here, we propose several approaches to improve assembling of RNA-seq data generated by second-generation sequencing technologies. We demonstrated that the systematic removal of irrelevant reads from a high coverage dataset prior to assembly, reduces runtime and improves the quality of the assembly. Further, we propose a novel RNA-seq assembly work- flow comprised of read error correction, normalization, assembly with informed parameter selection and transcript-level expression computation. In recent years, the popularity of third-generation sequencing technologies in- creased as long reads allow for accurate isoform quantification and gene-fusion detection, which is essential for biomedical research. We present a sequence-to-graph alignment method to detect and to quantify transcripts for third-generation sequencing data. Also, we propose the first gene-fusion prediction tool which is specifically tailored towards long-read data and hence achieves accurate expression estimation even on complex data sets. Moreover, our method predicted experimentally verified fusion events along with some novel events, which can be validated in the future

    A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms

    Get PDF
    Abstract Background With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Results We develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Conclusions Our strategy minimizes memory consumption while simultaneously obtaining comparable or improved accuracy over existing algorithms. It provides support for incremental updates of assemblies when new libraries become available

    Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

    Get PDF

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu

    New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

    Get PDF
    Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. RyĆ«tƍ implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. RyĆ«tƍs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to RyĆ«tƍ enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. RyĆ«tƍ outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. RyĆ«tƍ consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. RyĆ«tƍs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 RyĆ«tƍ - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The RyĆ«tƍ core algorithm 4.4 Improved Multi-sample transcript assembly with RyĆ«tƍ 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

    Compressed weighted de Bruijn graphs

    Get PDF
    We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest

    IMPROVING GENOME ANNOTATION WITH RNA-SEQ DATA

    Get PDF
    With the advent of next generation sequencing, researchers can now investigate genome of species and individuals in unprecedented detail. Each part of genome has its own function. Annotation is the process to identify the parts and their functions. Deep RNA sequencing (RNA-seq) emerged as a revolutionary technology for transcriptome analysis, now widely used to annotate genes. Our transcript assemblers, CLASS and CLASS2, were designed to better detect alternative splicing events and to find new transcripts from RNA-seq data. With sequencing costs dropping, experiments now routinely include multiple RNA-seq samples, to improve the power of statistical analyses. We took advantage of the power of multiple samples in the software PsiCLASS. PsiCLASS simultaneously assembles multiple RNA-seq samples, which significantly improves performance over the traditional ‘assemble-and-merge’ model. For many alignment and assembly applications, sequencing errors can confound downstream analyses. We implemented two k-mer-based error correctors, Lighter and Rcorrector, for whole genome sequencing data and for RNA-seq data, respectively. Lighter was the first k-mer-based error corrector without counting and is much faster and more memory-efficient than other error correctors while having comparable accuracy. Rcorrector searches for a path in the De Bruijn graph that is closest to the current read, using local k-mer thresholds to determine trusted k-mers. Rcorrector measurably improves de novo assembled transcripts, which is critical in annotating species without a high-quality reference genome. A newly assembled genome is typically highly fragmented, which makes it difficult to annotate. Contiguity information from paired-end RNA-seq reads can be used to connect multiple disparate pieces of the gene. We implemented this principle in Rascaf, a tool for assembly scaffolding with RNA-seq read alignments. Rascaf is highly practical, and has improved sensitivity and precision compared to traditional approaches using de novo assembled transcripts. Overall, the collection of algorithms, methods and tools represent a powerful and valuable resource that can be readily and effectively used in any genome sequencing and annotation project and for a vast array of transcriptomic analyses. Thesis committee members: Dr. Liliana Florea, Johns Hopkins University School of Medicine Dr. Ben Langmead, Johns Hopkins University Dr. Sarven Sabunciyan, Johns Hopkins University School of Medicin

    Genomic and Transcriptomic Studies on Non-Model Organisms

    Get PDF
    As the advance in high-throughput sequencing enables the generation of large volumes of genomic information, it provides researchers the opportunity to study non-model organisms even in the absence of a fully sequenced genome. The hugely advantageous progress calls for powerful sequencing assembly algorithms as these technologies also raise challenging assembly problems: (1) Some RNA products are highly expressed but others may have much lower expression level. (2) Data cannot easily be represented as linear structure, due to post-transcriptional modification like alternative splicing. (3) Conserved sequences in domains in gene families can result in assembly errors, (4) Sequencing errors due to technique limitations. Useful assembly algorithms are required to overcome the difficulties above. In these studies, there is often a need to identify similar transcripts in non-model organisms to transcripts found in related organisms. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms and then employ similarity comparison algorithms to identify them. I observe it is possible to obtain a more complete set of similar transcripts from transcriptome assembly by making use of evolutionary information. I apply new algorithms to study non-model organisms which play an important role in applied biology. Moreover, improvement of sequencing technologies and application of current algorithms also help to study interkingdom signals between blow flies and bacteria community. With current computational tools, I annotate genomes of Proteus mirabilis and Providencia stuartii, which play an important role in bacteria-insect interaction. The study shows significant features of these strains isolated, which provides useful information to develop and test hypothesis in related interactions in insects and bacteria

    An island-based approach for RNA-SEQ differential expression analysis.

    Get PDF
    High-throughput mRNA sequencing (also known as RNA-Seq) promises to be the technique of choice for studying transcriptome profiles, offering several advantages over old techniques such as microarrays. This technique provides the ability to develop precise methodologies for a variety of RNA-Seq applications including gene expression quantification, novel transcript and exon discovery, differential expression (DE) and splice variant detection. The detection of significantly changing features (e.g. genes, transcript isoforms, exons) in expression across biological samples is a primary application of RNA-Seq. Uncovering which features are significantly differentially expressed between samples can provide insight into their functions. One major limitation with the majority of recently developed methods for RNA-Seq differential expression is the dependency on annotated biological features to detect expression differences across samples. This forces the identification of expression levels and the detection of significant changes to known genomic regions. Thus, any significant changes occurring in unannotated regions will not be captured. To overcome this limitation, we developed a novel segmentation approach, Island-Based (IBSeq), for analyzing differential expression in RNA-Seq and targeted sequencing (exome capture) data without specific knowledge of an isoform. IBSeq segmentation determines individual islands of expression based on windowed read counts that can be compared across experimental conditions to determine differential island expression. In order to detect differentially expressed features, the significance of DE islands corresponding to each feature are combined using combined p-value methods. We evaluated the performance of our approach by comparing it to a number of existing gene DE methods using several benchmark MAQC RNA-Seq datasets. Using the area under ROC curve (auROC) as a performance metric, results show that IBSeq clearly outperforms all other methods compared
    • 

    corecore