486 research outputs found

    Optimization Techniques For Next-Generation Sequencing Data Analysis

    Get PDF
    High-throughput RNA sequencing (RNA-Seq) is a popular cost-efficient technology with many medical and biological applications. This technology, however, presents a number of computational challenges in reconstructing full-length transcripts and accurately estimate their abundances across all cell types. Our contributions include (1) transcript and gene expression level estimation methods, (2) methods for genome-guided and annotation-guided transcriptome reconstruction, and (3) de novo assembly and annotation of real data sets. Transcript expression level estimation, also referred to as transcriptome quantification, tackle the problem of estimating the expression level of each transcript. Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. We propose a novel simulated regression based method for transcriptome frequency estimation from RNA-Seq reads. Transcriptome reconstruction refers to the problem of reconstructing the transcript sequences from the RNA-Seq data. We present genome-guided and annotation-guided transcriptome reconstruction methods. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to currently state of the art methods. We further present the assembly and annotation of Bugula neritina transcriptome (a marine colonial animal), and Tallapoosa darter genome (a species-rich radiation freshwater fish)

    Statistical modeling and inference for complex-structured count data with applications in genomics and social science

    Get PDF
    2020 Spring.Includes bibliographical references.This dissertation describes models, estimation methods, and testing procedures for count data that build upon classic generalized linear models, including Gaussian, Poisson, and negative binomial regression. The methodological extensions proposed in this dissertation are motivated by complex structures for count data arising in three important classes of scientific problems, from both genomics and sociological contexts. Complexities include large scale, temporal dependence, zero-inflation and other mixture features, and group structure. The first class of problems involves count data that are collected from longitudinal RNA sequencing (RNA-seq) experiments, where the data consist of tens of thousands of short time series of counts, with replicate time series under treatment and under control. In order to determine if the time course differs between treatment and control, we consider two questions: 1) whether the treatment affects the geometric attributes of the temporal profiles and 2) whether any treatment effect varies over time. To answer the first question, we determine whether there has been a fundamental change in shape by modeling the transformed count data for genes at each time point using a Gaussian distribution, with the mean temporal profile generated by spline models, and introduce a measurement that quantifies the average minimum squared distance between the locations of peaks (or valleys) of each gene's temporal profile across experimental conditions. We then develop a testing framework based on a permutation procedure. Via simulation studies, we show that the proposed test achieves good power while controlling the false discovery rate. We also apply the test to data collected from a light physiology experiment on maize. To answer the second question, we model the time series of counts for each gene by a Gaussian-Negative Binomial model and introduce a new testing procedure that enjoys the optimality property of maximum average power. The test allows not only identification of traditional differentially expressed genes but also testing of a variety of composite hypotheses of biological interest. We establish the identifiability of the proposed model, implement the proposed method via efficient algorithms, and expose its good performance via simulation studies. The procedure reveals interesting biological insights when applied to data from an experiment that examines the effect of varying light environments on the fundamental physiology of a marine diatom. The second class of problems involves analyzing group-structured sRNA data that consist of independent replicates of counts for each sRNA across experimental conditions. Most existing methods—for both normalization and differential expression—are designed for non-group structured data. These methods may fail to provide correct normalization factors or fail to control FDR. They may lack power and may not be able to make inference on group effects. To address these challenges simultaneously, we introduce an inferential procedure using a group-based negative binomial model and a bootstrap testing method. This procedure not only provides a group-based normalization factor, but also enables group-based differential expression analysis. Our method shows good performance in both simulation studies and analysis of experimental data on roundworm. The last class of problems is motivated by the study of sensitive behaviors. These problems involve mixture-distributed count data that are collected by a quantitative randomized response technique (QRRT) which guarantees respondent anonymity. We propose a Poisson regression method based on maximum likelihood estimation computed via the EM algorithm. This method allows assessment of the importance of potential drivers of different quantities of non-compliant behavior. The method is illustrated with a case study examining potential drivers of non-compliance with hunting regulations in Sierra Leone

    Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

    Get PDF
    We present a novel pipeline and methodology for simultaneously estimating isoform expression and allelic imbalance in diploid organisms using RNA-seq data. We achieve this by modeling the expression of haplotype-specific isoforms. If unknown, the two parental isoform sequences can be individually reconstructed. A new statistical method, MMSEQ, deconvolves the mapping of reads to multiple transcripts (isoforms or haplotype-specific isoforms). Our software can take into account non-uniform read generation and works with paired-end reads

    Structured Bayesian methods for splicing analysis in RNA-seq data

    Get PDF
    In most eukaryotes, alternative splicing is an important regulatory mechanism of gene expression that results in a single gene coding for multiple protein isoforms, thus largely increases the diversity of the proteome. RNA-seq is widely used for genome-wide splicing isoform quantification, and several effective and powerful methods have been developed for splicing analysis with RNA-seq data. However, it remains problematic for genes with low coverages or large number of isoforms. These difficulties may in principle be ameliorated by exploiting correlations encoded in the structured data sources. This thesis contributes to developments of Bayesian methods for splicing analysis by leveraging additional information in multiple datasets with structured prior distributions. First, we developed DICEseq, the first isoform quantification method tailored to time-series RNA-seq experiments. DICEseq explicitly models the correlations between experiments at different time points to aid the quantification of isoforms across experiments. Numerical experiments on both simulated and real datasets show that DICEseq yields more accurate results than state-of-the-art methods, an advantage that can become considerable at low coverage levels. Furthermore, DICEseq permits to quantify the trade-off between temporal sampling of RNA and depth of sequencing, frequently an important choice when planning experiments. Second, we developed BRIE (Bayesian Regression for Isoform Estimation), a Bayesian hierarchical model which resolves the difficulties in splicing analysis in single-cell RNA-seq (scRNA-seq) data by learning an informative prior distribution from sequence features. This method combines the quantification and imputation for splicing analysis via a Bayesian way, which is particularly useful in scRNA-seq data due to its extreme low coverages and high technical noises. We validated BRIE on several scRNA-seq data sets, showing that BRIE yields reproducible estimates of exon inclusion ratios in single cells. Third, we provided an effective tool by using Bayes factor to sensitively detect differential splicing between different single cells. When applying BRIE to a few real datasets, we found interesting heterogeneity patterns in splicing events across cell population, for example alternative exons in DNMT3B. In summary, this thesis proposes structured Bayesian methods to integrate multiple datasets to improve splicing analysis and study its biological functions

    Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

    Get PDF
    Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

    From Pieces To Paths: Combining Disparate Information in Computational Analysis of RNA-Seq.

    Get PDF
    As high-throughput sequencing technology has advanced in recent decades, large-scale genomic data with high-resolution have been generated for solving various problems in many felds. One of the state-of-the-art sequencing techniques is RNA sequencing, which has been widely used to study the transcriptomes of biological systems through millions of reads. The ultimate goal of RNA sequencing bioinformatics algorithms is to maximally utilize the information stored in a large amount of pieced-together reads to unveil the whole landscape of biological function at the transcriptome level. Many bioinformatics methods and pipelines have been developed for better achieving this goal. However, one central question of RNA sequencing is the prediction uncertainty due to the short read length and the low sampling rate of underexpressed transcripts. Both conditions raise ambiguities in read mapping, transcript assembly, transcript quantifcation, and even the downstream analysis. This dissertation focuses on approaches to reducing the above uncertainty by incorporating additional information, of disparate kinds, into bioinformatics models and modeling assessments. I addressed three critical issues in RNA sequencing data analysis. (1) we evaluated the performance of current de novo assembly methods and their evaluation methods using the transcript information from a third generation sequencing platform, which provides a longer sequence length but with a higher error rate than next-generation sequencing; (2) we built a Bayesian graphical model for improving transcript quantifcation and di˙erentially expressed isoform identifcation by utilizing the shared information from biological replicates; (3) we built a joint pathway and gene selection model by incorporating pathway structures from an expert database. We conclude that the incorporation of appropriate information from extra resources enables a more reliable assessment and a higher prediction performance in RNA sequencing data analysis

    HBA-DEALS: accurate and simultaneous identification of differential expression and splicing using hierarchical Bayesian analysis.

    Get PDF
    We present Hierarchical Bayesian Analysis of Differential Expression and ALternative Splicing (HBA-DEALS), which simultaneously characterizes differential expression and splicing in cohorts. HBA-DEALS attains state of the art or better performance for both expression and splicing and allows genes to be characterized as having differential gene expression, differential alternative splicing, both, or neither. HBA-DEALS analysis of GTEx data demonstrated sets of genes that show predominant DGE or DAST across multiple tissue types. These sets have pervasive differences with respect to gene structure, function, membership in protein complexes, and promoter architecture

    Bayesian nonparametric discovery of isoforms and individual specific quantification

    Get PDF
    Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop BIISQ, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. BIISQ does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. BIISQ shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios

    Transcript assembly, quantification and differential alternative splicing detection from RNA-Seq

    Get PDF
    This dissertation is focused on improving RNA-Seq processing in terms of transcript assembly, transcript quantification and detection of differential alternative splicing. There are two major challenges of solving these three problems. The first is accurately deriving transcript-level expression values from RNA-Seq reads that often align ambiguously to a set of overlapping isoforms. To make matter worse, gene annotation tends to misguide transcript quantification as new transcripts are often discovered in new RNA-Seq experiments. The second challenge is accounting for intrinsic uncertainties or variabilities in RNA-Seq measurement when calling differential alternative splicing from multiple samples across two conditions. Those uncertainties include coverage bias and biological variations. Failing to account for these variabilities can lead to higher false positive rates. To addressed these challenges, I develop a series of novel algorithms which are implemented in a software package called Strawberry. To tackle the read assignment uncertainty challenge, Strawberry assembles aligned RNA-Seq reads into transcripts using a constrained flow network algorithm. After the assembly, Strawberry uses a latent class model to assign reads to transcripts. These two steps use different optimization frameworks but utilize the same graph structure, which allows a highly efficient, expandable and accurate algorithm for dealing large data. To infer differential alternative splicing, Strawberry extends the single sample quantification model by imposing a generalized linear model on the relative transcript proportions. To account for count overdispersion, Strawberry uses an empirical Bayesian hierarchical model. For coverage bias, Strawberry performs a bias correction step which borrows information across samples and genes before fitting the differential analysis model. A serious of simulated and real data are used to evaluate and benchmark Strawberry\u27s result. Strawberry outperforms Cufflinks and StringTie in terms of both assembly and quantification accuracies. In terms of detecting differential alternative splicing, Strawberry also outperforms several state-of-the-art methods including DEXSeq, Cuffdiff 2 and DSGseq. Strawberry and its supporting code, e.g., simulation and validation, are freely available at my github (\url{https://github.com/ruolin})
    corecore