185 research outputs found

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    Finding differential splice junctions in RNA-Seq data as transcriptional biomarkers for prostate cancer

    Get PDF
    Alternative RNA splicing is a naturally occurring phenomenon that has been associated with different types of cancer. Detecting splice junctions in the genome of an organism is the key to the study of alternative splicing. RNA-Seq as a high-throughput sequencing technology has recently opened new horizons on the studying of various fields of transcriptomics, such as gene expression, chimeric events and alternative splicing. In this research, we study prostate cancer from the viewpoint of splicing events as the second most common cancer in North America. We have proposed a method for differentially detecting splice junctions, and in a broader sense splice variants, from RNA-Seq data. We have designed a 2-D peak finding algorithm to combine and remove the dubious junctions across different samples of our population. A scoring mechanism is used to select junctions as features for prediction of cancer RNA-Seq data belonging to patients diagnosed with prostate cancer against benign samples. These junctions could be proposed as potential biomarkers for prostate cancer. We have employed support vector machines which proved to be highly successful in prediction of prostate cancer

    USING MACHINE LEARNING TECHNIQUES FOR FINDING MEANINGFUL TRANSCRIPTS IN PROSTATE CANCER PROGRESSION

    Get PDF
    Prostate Cancer is one of the most common types of cancer among Canadian men. Next generation sequencing that uses RNA-Seq can be valuable in studying cancer, since it provides large amounts of data as a source for information about biomarkers. For these reasons, we have chosen RNA-Seq data for prostate cancer progression in our study. In this research, we propose a new method for finding transcripts that can be used as genomic features. In this regard, we have gathered a very large amount of transcripts. There are a large number of transcripts that are not quite relevant, and we filter them by applying a feature selection algorithm. The results are then processed through a machine learning technique for classification such as the support vector machine which is used to classify different stages of prostate cancer. Finally, we have identified potential transcripts associated with prostate cancer progression. Ideally, these transcripts can be used for improving diagnosis, treatment, and drug development

    Identification of novel alternative splicing biomarkers for breast cancer with LC/MS/MS and RNA-Seq

    Get PDF
    Background: Alternative splicing isoforms have been reported as a new and robust class of diagnostic biomarkers. Over 95% of human genes are estimated to be alternatively spliced as a powerful means of producing functionally diverse proteins from a single gene. The emergence of next-generation sequencing technologies, especially RNA-seq, provides novel insights into large-scale detection and analysis of alternative splicing at the transcriptional level. Advances in Proteomic Technologies such as liquid chromatography coupled tandem mass spectrometry (LC-MS/MS), have shown tremendous power for the parallel characterization of large amount of proteins in biological samples. Although poor correspondence has been generally found from previous qualitative comparative analysis between proteomics and microarray data, significantly higher degrees of correlation have been observed at the level of exon. Combining protein and RNA data by searching LC-MS/MS data against a customized protein database from RNA-Seq may produce a subset of alternatively spliced protein isoform candidates that have higher confidence. Results: We developed a bioinformatics workflow to discover alternative splicing biomarkers from LC-MS/MS using RNA-Seq. First, we retrieved high confident, novel alternative splicing biomarkers from the breast cancer RNA-Seq database. Then, we translated these sequences into in silico Isoform Junction Peptides, and created a customized alternative splicing database for MS searching. Lastly, we ran the Open Mass spectrometry Search Algorithm against the customized alternative splicing database with breast cancer plasma proteome. Twenty six alternative splicing biomarker peptides with one single intron event and one exon skipping event were identified. Further interpretation of biological pathways with our Integrated Pathway Analysis Database showed that these 26 peptides are associated with Cancer, Signaling, Metabolism, Regulation, Immune System and Hemostasis pathways, which are consistent with the 256 alternative splicing biomarkers from the RNA-Seq. Conclusions: This paper presents a bioinformatics workflow for using RNA-seq data to discover novel alternative splicing biomarkers from the breast cancer proteome. As a complement to synthetic alternative splicing database technique for alternative splicing identification, this method combines the advantages of two platforms: mass spectrometry and next generation sequencing and can help identify potentially highly sample-specific alternative splicing isoform biomarkers at early-stage of cancer

    A SVM-based method to classify RBM20 affected and not affected exons

    Get PDF
    Mutations of RNA binding motif protein 20 (RBM20) have been recently reported to cause Human dilated cardiomyopathy (DCM) (Brauch et al., 2009, Li et al., 2010). DCM is the major cause of heart failure and mortality around the world (Jefferies and Towbin, 2010). Overall, 25\u201350% of DCM cases are familiar and causative mutations which have been described in more than 50 genes encoding mostly for structural components of cardiomyocytes. RBM20 belongs to the family of the SR and SR-related RNA binding proteins which assemble in the spliceosome taking part in the splicing of pre-mRNA. RBM20 is mainly expressed in striated muscle, with the highest levels in the heart (Guo et al., 2012). Due to its involvement in DCM, RBM20 was studied a lot to unveil its mechanism of action and its RNA targets (Guo et al., 2012, Li et al., 2013). Guo and colleagues reported a set of 31 genes showing a RBM20 dependent splicing from a whole transcriptome analysis in rats and humans (Guo et al., 2012). More recently, Maatz and colleagues reported an additional set of 18 rat genes and observed that RNA sequences recognized by RBM20 are likely to be located in the 400 nucleotides flanking the exons whose alternative splicing is regulated by RBM20 (Maatz et al., 2014). However, both the suggested RNA sequence which is recognized by RBM20 and its over-representation over the flanking regions of affected exons remain poor predictors to target genes presenting splicing events regulated by RBM20. The aim of this work was, thus, to characterize, through a bioinformatic approach, the sequence motifs of the exons whose alternative splicing was affected by RBM20, in order to ameliorate the prediction of the genes (exons) affected by RBM20. A differential expression analysis was performed to select the dataset of RBM20 affected exons; a further dataset was retrieved from literature data (Maatz et al., 2014). A Support Vector Machine (SVM) approach evaluating more kinds of genetic elements binding in the flanking regions of our target exons was used. A SVM method was chose to classify RBM20 affected and not affected exons, but other machine learning algorithms could have been used as well; however, SVM is among the most commonly used ones. From the analyses, our model resulted to well discriminate RBM20 affected from not affected exons. From a biological and functional point of view, this approach helps us to target novel candidate genes associated to diseases depending on a dysregulation of RBM20. This study provided additional information about RBM20 regulation of target exons, based not only on the RNA binding site, but also on other genetic elements associated to the binding site. Furthermore, we proposed the first model based on a SVM algorithm for the classification of RBM20 affected and not affected exons

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Post-Transcriptional Regulation In The Drosophila Sex Determination Pathway

    Get PDF
    Sexually reproducing organisms produce two very different phenotypes (males and females), by differential deployment of essentially the same gene content. This dimorphism provides an excellent model to study how transcriptomes are differentially regulated, which is one of the central problems of biology. The core sex determination pathway of Drosophila is a well described cascade of transcriptional and post-transcriptional regulation, but knowledge of the downstream components is largely incomplete. High throughput technologies have provided great advances in understanding transcriptome regulation, but limits of the technology have lead to a focus on whole gene expression measurements, rather than post-transcriptional regulation. RNA-Seq experiments, in which transcripts are converted to cDNA and sequenced, allow the resolution and quantification of alternative transcript isoforms, potentially elucidating the post-transcriptional network. However, methods to analyze splicing are underdeveloped, and challenges in transcript assembly and quantification remain unresolved. This work describes the development of the Splicing Analysis Kit (Spanki) as a fast, open source, suite of tools that uses simulations based on real RNA-Seq data to characterize errors in a given dataset, and user tunable filters that minimize those errors. Spanki quantifies splicing differences in transcripts from the same loci within a sample, as well as between samples by using only those reads that directly assay splicing events (junction spanning reads). Despite the reliance on a fraction of the total data, sequencing depth typically generated in an RNA-Seq experiment is sufficient to identify differentially regulated splicing, and error profiles are superior. I demonstrate that this computational approach outperforms several commonly used approaches in an analysis of sex-differential splicing in Drosophila heads. Next I examine the effects of disrupting post-transcriptional regulation in Drosophila heads. I apply the Spanki software to analyze RNA-Seq data for mutant lines of two post-transcriptional regulators: Darkener of apricot (Doa) and found in neurons (fne). Doa, a serine-threonine kinase, regulates splicing by phosphorylating SR proteins, vital components of the splicing machinery. Found in neurons (fne) binds to transcripts and is involved in RNA metabolism. I demonstrate sex-differences in response to disruption of post-transcriptional regulation, and hypothesize that they are informative of sex-differentiation pathways. Finally, I examine the conservation of splicing regulation within the Drosophila lineage. I show that junction based splicing analysis is effective in making interspecific comparisons without the need for complete transcript models. I use these results to demonstrate the conservation of sex-differential splicing across 40 million years of evolution in 15 species in the Drosophila genus

    Differential Architecture Search in Deep Learning for DNA Splice Site Classification

    Get PDF
    The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design
    corecore