1,393 research outputs found

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION

    Get PDF
    Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification. A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures

    RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts.

    Get PDF
    Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome

    Global and unbiased detection of splice junctions from RNA-seq data

    Get PDF
    SplitSeek can be used to detect novel splicing events in SOLiD RNA-seq data without the need for a pre-defined library

    RNA-Sequencing analysis from the triceps muscle of normal and myostatin-deficient mice using various tools

    Get PDF
    RNA-Sequencing technologies are being used to determine the single nucleotide polymorphisms, insertions, deletions and gene expression. The purpose of this study was to analyze the effect of myostatin in the triceps muscles of mice using 65 bases single-end RNA-Sequencing data from the Illumina platform. Another aim was to analyze alternative splicing events for differentially expressed genes in the above data. Finally, commercially available and open source software packages were compared for their splice junction detection abilities. CASAVA was used for determining the exon, gene and splice junction counts. Partek Genomic Suite was used to perform a two-way analysis of variance followed by the identification of differentially expressed genes. The splicing events were identified using the software packages CASAVA, TopHat, MapSplice and SpliceMap. The results of splice junction detection were viewed in the UCSC genome browser. The performance and features of the above software were compared. The results revealed that myostatin deficiency significantly alters gene expression. This study provides an unbiased view towards commercial and open source RNA-Sequencing software using a very significant dataset. The results show that a preliminary inspection for alternative splicing can be performed; however, currently no software alone can fully analyze the RNA-Seq data and needs complementary software to assist in the complete analysis. The results of this study would benefit researchers in choosing the right software for their purposes considering the resources like time, man-power and money available

    iMapSplice: Alleviating Reference Bias Through Personalized RNA-seq Alignment

    Get PDF
    Genomic variants in both coding and non-coding sequences can have functionally important and sometimes deleterious effects on exon splicing of gene transcripts. For transcriptome profiling using RNA-seq, the accurate alignment of reads across exon junctions is a critical step. Existing algorithms that utilize a standard reference genome as a template sometimes have difficulty in mapping reads that carry genomic variants. These problems can lead to allelic ratio biases and the failure to detect splice variants created by splice site polymorphisms. To improve RNA-seq read alignment, we have developed a novel approach called iMapSplice that enables personalized mRNA transcriptome profiling. The algorithm makes use of personal genomic information and performs an unbiased alignment towards genome indices carrying both reference and alternative bases. Importantly, this breaks the dependency on reference genome splice site dinucleotide motifs and enables iMapSplice to discover personal splice junctions created through splice site polymorphisms. We report comparative analyses using a number of simulated and real datasets. Besides general improvements in read alignment and splice junction discovery, iMapSplice greatly alleviates allelic ratio biases and unravels many previously uncharacterized splice junctions created by splice site polymorphisms, with minimal overhead in computation time and storage. Software download URL: https://github.com/LiuBioinfo/iMapSplice

    From RNA-seq reads to differential expression results

    Get PDF
    Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression

    A comprehensive evaluation of alignment algorithms in the context of RNA-seq.

    Get PDF
    Transcriptome sequencing (RNA-Seq) overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete

    Acfs: accurate circRNA identification and quantification from RNA-Seq data

    Get PDF
    Circular RNAs (circRNAs) are a group of single-stranded RNAs in closed circular form. They are splicing-generated, widely expressed in various tissues and have functional implications in development and diseases. To facilitate genome-wide characterization of circRNAs using RNA-Seq data, we present a freely available software package named acfs. Acfs allows de novo, accurate and fast identification and abundance quantification of circRNAs from single- and paired-ended RNA-Seq data. On simulated datasets, acfs achieved the highest F1 accuracy and lowest false discovery rate among current state- of-the-art tools. On real-world datasets, acfs efficiently identified more bona fide circRNAs. Furthermore, we demonstrated the power of circRNA analysis on two leukemia datasets. We identified a set of circRNAs that are differentially expressed between AML and APL samples, which might shed light on the potential molecular classification of complex diseases using circRNA profiles. Moreover, chromosomal translocation, as manifested in numerous diseases, could produce not only fusion transcripts but also fusion circRNAs of clinical relevance. Featured with high accuracy, low FDR and the ability to identify fusion circRNAs, we believe that acfs is well suited for a wide spectrum of applications in characterizing the landscape of circRNAs from non- model organisms to cancer biology
    corecore