5,236 research outputs found

    An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs.

    Get PDF
    Reconstructing full-length transcript isoforms from sequence fragments (such as ESTs) is a major interest and challenge for bioinformatic analysis of pre-mRNA alternative splicing. This problem has been formulated as finding traversals across the splice graph, which is a directed acyclic graph (DAG) representation of gene structure and alternative splicing. In this manuscript we introduce a probabilistic formulation of the isoform reconstruction problem, and provide an expectation-maximization (EM) algorithm for its maximum likelihood solution. Using a series of simulated data and expressed sequences from real human genes, we demonstrate that our EM algorithm can correctly handle various situations of fragmentation and coupling in the input data. Our work establishes a general probabilistic framework for splice graph-based reconstructions of full-length isoforms

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

    Get PDF

    ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences

    Get PDF
    BACKGROUND: Currently available methods to predict splice sites are mainly based on the independent and progressive alignment of transcript data (mostly ESTs) to the genomic sequence. Apart from often being computationally expensive, this approach is vulnerable to several problems – hence the need to develop novel strategies. RESULTS: We propose a method, based on a novel multiple genome-EST alignment algorithm, for the detection of splice sites. To avoid limitations of splice sites prediction (mainly, over-predictions) due to independent single EST alignments to the genomic sequence our approach performs a multiple alignment of transcript data to the genomic sequence based on the combined analysis of all available data. We recast the problem of predicting constitutive and alternative splicing as an optimization problem, where the optimal multiple transcript alignment minimizes the number of exons and hence of splice site observations. We have implemented a splice site predictor based on this algorithm in the software tool ASPIC (Alternative Splicing PredICtion). It is distinguished from other methods based on BLAST-like tools by the incorporation of entirely new ad hoc procedures for accurate and computationally efficient transcript alignment and adopts dynamic programming for the refinement of intron boundaries. ASPIC also provides the minimal set of non-mergeable transcript isoforms compatible with the detected splicing events. The ASPIC web resource is dynamically interconnected with the Ensembl and Unigene databases and also implements an upload facility. CONCLUSION: Extensive bench marking shows that ASPIC outperforms other existing methods in the detection of novel splicing isoforms and in the minimization of over-predictions. ASPIC also requires a lower computation time for processing a single gene and an EST cluster. The ASPIC web resource is available at

    ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization

    Get PDF
    Alternative splicing (AS) is now emerging as a major mechanism contributing to the expansion of the transcriptome and proteome complexity of multicellular organisms. The fact that a single gene locus may give rise to multiple mRNAs and protein isoforms, showing both major and subtle structural variations, is an exceptionally versatile tool in the optimization of the coding capacity of the eukaryotic genome. The huge and continuously increasing number of genome and transcript sequences provides an essential information source for the computational detection of genes AS pattern. However, much of this information is not optimally or comprehensively used in gene annotation by current genome annotation pipelines. We present here a web resource implementing the ASPIC algorithm which we developed previously for the investigation of AS of user submitted genes, based on comparative analysis of available transcript and genome data from a variety of species. The ASPIC web resource provides graphical and tabular views of the splicing patterns of all full-length mRNA isoforms compatible with the detected splice sites of genes under investigation as well as relevant structural and functional annotation. The ASPIC web resource—available at —is dynamically interconnected with the Ensembl and Unigene databases and also implements an upload facility


    Get PDF
    The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases

    Visualization and analysis of RNA-Seq assembly graphs.

    Get PDF
    RNA-Seq is a powerful transcriptome profiling technology enabling transcript discovery and quantification. Whilst most commonly used for gene-level quantification, the data can be used for the analysis of transcript isoforms. However, when the underlying transcript assemblies are complex, current visualization approaches can be limiting, with splicing events a challenge to interpret. Here, we report on the development of a graph-based visualization method as a complementary approach to understanding transcript diversity from short-read RNA-Seq data. Following the mapping of reads to a reference genome, a read-to-read comparison is performed on all reads mapping to a given gene, producing a weighted similarity matrix between reads. This is used to produce an RNA assembly graph, where nodes represent reads and edges similarity scores between them. The resulting graphs are visualized in 3D space to better appreciate their sometimes large and complex topology, with other information being overlaid on to nodes, e.g. transcript models. Here we demonstrate the utility of this approach, including the unusual structure of these graphs and how they can be used to identify issues in assembly, repetitive sequences within transcripts and splice variants. We believe this approach has the potential to significantly improve our understanding of transcript complexity

    Local assembly and pre-mRNA splicing analyses by high-throughput sequencing data

    Get PDF
    Next generation sequencing (NGS) approaches have become one of the most widely used tools in biotechnology. With high throughput sequencing, people can analyze non-model species at an unprecedented high resolution. NGS provides fast, deep and cheap sequencing solutions, and it has been used to answer various biological questions. In this thesis, I have developed a set of tools and used them to study several interesting research topics. First, de novo whole-genome assembly is still a very challenging technical task. For eukaryotic genomes, de novo assembly typically requires computational resources with very large memory and fast processors. Instead of trying to assemble the whole genome as done in previous approaches, I focus on efficiently reconstructing the genomic regions related to the homologous protein or cDNA sequences. I have developed SRAssembler, a local assembly program using the iterative chromosome walking strategy to assemble the loci of interest directly. Second, I used high-throughput RNA sequencing (refered to as RNA-Seq) data to analyze different intron splicing models and their relative frequency of occurrence. The first mechanism I explored is the recursive splicing patterns in large introns. I have implemented a pipeline called RSSFinder, which can search for recursive sites confirmed by RNA-Seq data. My study suggests the prevalence of recursive splicing in different species. These predicted recursive sites can also be used to investigate certain diseases associated with abnormal splicing of transcripts. In addition, I have demonstrated the use of RNA-Seq data to decipher the detailed mechanisms involved in splicing and their relationship with transcription. Here I proposed mathematical models to estimate the distribution of mRNA splicing intermediates. I evaluated my models with simulated data and an Arabidopsis thaliana dataset. My results indicate that co-transcriptional splicing is widespread in Arabidopsis thaliana

    SQANTI : extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

    Get PDF
    High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes
    • …