4 research outputs found

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    A random effects model for the identification of differential splicing (REIDS) using exon and HTA arrays

    Get PDF
    Background: Alternative gene splicing is a common phenomenon in which a single gene gives rise to multiple transcript isoforms. The process is strictly guided and involves a multitude of proteins and regulatory complexes. Unfortunately, aberrant splicing events do occur which have been linked to genetic disorders, such as several types of cancer and neurodegenerative diseases (Fan et al., Theor Biol Med Model 3:19, 2006). Therefore, understanding the mechanism of alternative splicing and identifying the difference in splicing events between diseased and healthy tissue is crucial in biomedical research with the potential of applications in personalized medicine as well as in drug development. Results: We propose a linear mixed model, Random Effects for the Identification of Differential Splicing (REIDS), for the identification of alternative splicing events. Based on a set of scores, an exon score and an array score, a decision regarding alternative splicing can be made. The model enables the ability to distinguish a differential expressed gene from a differential spliced exon. The proposed model was applied to three case studies concerning both exon and HTA arrays. Conclusion: The REIDS model provides a work flow for the identification of alternative splicing events relying on the established linear mixed model. The model can be applied to different types of arrays

    Characterizing and reassembling the COPD and ILD transcriptome using RNA-Seq

    Full text link
    Chronic Obstructive Pulmonary Disease (COPD) is the 3rd leading cause of death in the US, and idiopathic pulmonary fibrosis (IPF), a type of Interstitial Lung Disease (ILD), is a fast acting, irreversible disease that leads to mortality within 3-5 years. RNA-sequencing provides the opportunity to quantitatively examine the sequences of millions mRNAs, and offers the potential to gain unprecedented insights into the structure of chronic non-malignant lung disease transcriptome. By identifying changes in splicing and novel loci expression associated with disease, we may be able to gain a better understanding of their pathogenesis, identify novel disease-specific biomarkers, and find better targets for therapy. Using RNA-seq data that our group generated on 281 human lung tissue samples (47=Control, 131=COPD, 103=ILD), I initially defined the transcriptomic landscape of lung tissue by identifying which genes were expressed in each tissue sample. I used a mixture model to separate genes into reliable and not reliable expression. Next, I employed reads that overlapped splice junctions in a linear model interaction term to identify disease-specific differential splicing. I identified alternatively spliced genes between control and disease tissues and validated three (PDGFA, NUMB, SCEL) of these genes with qPCR and nanostring (a hybridization-based barcoding technique used to quantify transcripts). Finally, I implemented and improved a pipeline to perform transcriptome assembly using Cufflinks that led to the identification of 1,855 novel loci that did not overlap with UCSC, Vega, and Ensembl annotations. The loci were classified into potential coding and non-coding loci (191 and 1,664, respectively). Expression analysis revealed that there were 120 IPF-associated and 10 emphysema-associated differentially expressed (q < 0.01) novel loci. RNA-seq provides a high-resolution transcript-level view of the pulmonary transcriptome and its modification in lung disease. It has enabled a new understanding of the lung transcriptome structure because it measures not only the transcripts we know but also the ones we do not know. The approaches and improvements I have employed have identified these novel targets and make possible further downstream functional analysis that could identify better targets for therapy and lead to an even better understanding of chronic lung disease pathogenesis.2031-01-01T00:00:00

    Finite Bayesian mixture models with applications in spatial cluster analysis and bioinformatics

    Get PDF
    In many statistical applications, one encounters populations that form homogenous subgroups regarding one or several characteristics. Across the subgroups, however, heterogeneity may often be found. Mixture distributions are a natural means to model data from such applications. This PhD thesis is based on two projects that focus on such applications. In the first project, spatial nanoscale clusters formed by Ras proteins in the cell membrane are investigated. Such clusters play a crucial role in intracellular communication and are thus of interest in cancer research. In this case, the subgroups are clustered and non-clustered proteins. In the second project, epigenomic data obtained from sequencing experiments are integrated with another genomic or epigenomic input, aiming, e.g., to detect genes that contribute to the development of cancer. Here, the subgroups are defined by a) genes presenting congruent (epi)genomic aberrations in both considered variables, b) genes presenting incongruent aberrations, and c) genes lacking aberrations in at least one of the variables. Employing a Bayesian framework, objects are classified in both projects by fitting finite univariate mixture distributions with a small fixed number of components to values from a score summarizing relevant information about the research question. Such mixture distributions have favorable characteristics in terms of interpretation and present little sensitivity to label switching in Markov Chain Monte Carlo analyses. Mixtures of gamma distributions are considered for Ras proteins, while mixtures of normal and exponential or gamma distributions are a focus for the bioinformatic analysis. In the latter, classification is the primary goal, while in the Ras protein application, estimating key parameters of the spatial clustering is of more interest. The results of both projects are presented in this thesis. For both applications, the methods have been implemented in software and their performance is compared with competing approaches on experimental as well as on simulated data. To warrant an appropriate simulation of Ras protein patterns, a new cluster point process model called the double Matérn cluster process is developed and described in this thesis
    corecore