1,588 research outputs found

    NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION

    Get PDF
    Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures

    Accurate spliced alignment of long RNA sequencing reads

    Get PDF
    Motivation: Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses are dependent on read alignments. However, the error rate and sequencing length of the reads create new challenges for accurately aligning them, particularly around small exons. Results: We present an alignment method uLTRA for long RNA sequencing reads based on a novel two-pass collinear chaining algorithm. We show that uLTRA produces higher accuracy over state-of-the-art aligners with substantially higher accuracy for small exons on simulated and synthetic data. On simulated data, uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 and 20 nucleotides. On biological data where true read location is unknown, we show several examples where uLTRA aligns to known and novel isoforms containing small exons that are not detected with other aligners. While uLTRA obtains its accuracy using annotations, it can also be used as a wrapper around minimap2 to align reads outside annotated regions.Peer reviewe

    Context-based RNA-seq mapping

    Get PDF
    In recent years, the sequencing of RNA (RNA-seq) using next generation sequencing (NGS) technology has become a powerful tool for analyzing the transcriptomic state of a cell. Modern NGS platforms allow for performing RNA-seq experiments in a few days, resulting in millions of short sequencing reads. A crucial step in analyzing RNA-seq data generally is determining the transcriptomic origin of the sequencing reads (= read mapping). In principal, read mapping is a sequence alignment problem, in which the short sequencing reads (30 - 500 nucleotides) are aligned to much larger reference sequences such as the human genome (3 billion nucleotides). In this thesis, we present ContextMap, an RNA-seq mapping approach that evaluates the context of the sequencing reads for determining the most likely origin of every read. The context of a sequencing read is defined by all other reads aligned to the same genomic region. The ContextMap project started with a proof of concept study, in which we showed that our approach is able to improve already existing read mapping results provided by other mapping programs. Subsequently, we developed a standalone version of ContextMap. This implementation no longer relied on mapping results of other programs, but determined initial alignments itself using a modification of the Bowtie short read alignment program. However, the original ContextMap implementation had several drawbacks. In particular, it was not able to predict reads spanning over more than two exons and to detect insertions or deletions (indels). Furthermore, ContextMap depended on a modification of a specific Bowtie version. Thus, it could neither benefit of Bowtie updates nor of novel developments (e.g. improved running times) in the area of short read alignment software. For addressing these problems, we developed ContextMap 2, an extension of the original ContextMap algorithm. The key features of ContextMap 2 are the context-based resolution of ambiguous read alignments and the accurate detection of reads crossing an arbitrary number of exon-exon junctions or containing indels. Furthermore, a plug-in interface is provided that allows for the easy integration of alternative short read alignment programs (e.g. Bowtie 2 or BWA) into the mapping workflow. The performance of ContextMap 2 was evaluated on real-life as well as synthetic data and compared to other state-of-the-art mapping programs. We found that ContextMap 2 had very low rates of misplaced reads and incorrectly predicted junctions or indels. Additionally, recall values were as high as for the top competing methods. Moreover, the runtime of ContextMap 2 was at least two fold lower than for the best competitors. In addition to the mapping of sequencing reads to a single reference, the ContextMap approach allows the investigation of several potential read sources (e.g. the human host and infecting pathogens) in parallel. Thus, ContextMap can be applied to mine for infections or contaminations or to map data from meta-transcriptomic studies. Furthermore, we developed methods based on mapping-derived statistics that allow to assess confidence of mappings to identified species and to detect false positive hits. ContextMap was evaluated on three real-life data sets and results were compared to metagenomics tools. Here, we showed that ContextMap can successfully identify the species contained in a sample. Moreover, in contrast to most other metagenomics approaches, ContextMap also provides read mapping results to individual species. As a consequence, read mapping results determined by ContextMap can be used to study the gene expression of all species contained in a sample at the same time. Thus, ContextMap might be applied in clinical studies, in which the influence of infecting agents on host organisms is investigated. The methods presented in this thesis allow for an accurate and fast mapping of RNA-seq data. As the amount of available sequencing data increases constantly, these methods will likely become an important part of many RNA-seq data analyses and thus contribute valuably to research in the field of transcriptomics

    Post-Transcriptional Regulation In The Drosophila Sex Determination Pathway

    Get PDF
    Sexually reproducing organisms produce two very different phenotypes (males and females), by differential deployment of essentially the same gene content. This dimorphism provides an excellent model to study how transcriptomes are differentially regulated, which is one of the central problems of biology. The core sex determination pathway of Drosophila is a well described cascade of transcriptional and post-transcriptional regulation, but knowledge of the downstream components is largely incomplete. High throughput technologies have provided great advances in understanding transcriptome regulation, but limits of the technology have lead to a focus on whole gene expression measurements, rather than post-transcriptional regulation. RNA-Seq experiments, in which transcripts are converted to cDNA and sequenced, allow the resolution and quantification of alternative transcript isoforms, potentially elucidating the post-transcriptional network. However, methods to analyze splicing are underdeveloped, and challenges in transcript assembly and quantification remain unresolved. This work describes the development of the Splicing Analysis Kit (Spanki) as a fast, open source, suite of tools that uses simulations based on real RNA-Seq data to characterize errors in a given dataset, and user tunable filters that minimize those errors. Spanki quantifies splicing differences in transcripts from the same loci within a sample, as well as between samples by using only those reads that directly assay splicing events (junction spanning reads). Despite the reliance on a fraction of the total data, sequencing depth typically generated in an RNA-Seq experiment is sufficient to identify differentially regulated splicing, and error profiles are superior. I demonstrate that this computational approach outperforms several commonly used approaches in an analysis of sex-differential splicing in Drosophila heads. Next I examine the effects of disrupting post-transcriptional regulation in Drosophila heads. I apply the Spanki software to analyze RNA-Seq data for mutant lines of two post-transcriptional regulators: Darkener of apricot (Doa) and found in neurons (fne). Doa, a serine-threonine kinase, regulates splicing by phosphorylating SR proteins, vital components of the splicing machinery. Found in neurons (fne) binds to transcripts and is involved in RNA metabolism. I demonstrate sex-differences in response to disruption of post-transcriptional regulation, and hypothesize that they are informative of sex-differentiation pathways. Finally, I examine the conservation of splicing regulation within the Drosophila lineage. I show that junction based splicing analysis is effective in making interspecific comparisons without the need for complete transcript models. I use these results to demonstrate the conservation of sex-differential splicing across 40 million years of evolution in 15 species in the Drosophila genus

    Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

    Get PDF
    • …
    corecore