14 research outputs found

    GPrank : an R package for detecting dynamic elements from genome-wide time series

    Get PDF
    Background: Genome-wide high-throughput sequencing (HIS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results: Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HIS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions: Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes.Peer reviewe

    GPrank: an R package for detecting dynamic elements from genome-wide time series

    Get PDF
    Abstract Background Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes

    Gaussian process test for high-throughput sequencing time series : application to experimental evolution

    Get PDF
    The work was supported under the European ERASysBio+ initiative project ‘SYNERGY’ through the Academy of Finland [135311]. A.H. was also supported by the Academy of Finland [259440] and H.T. was supported by Alfred Kordelin Foundation. R.K. was supported by ERC (ArchAdapt). A.J. is member of the Vienna Graduate School of Population Genetics which is supported by a grant of the Austrian Science Fund (FWF) [W1225-B20].MOTIVATION: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but also monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. RESULTS: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation. AVAILABILITY AND IMPLEMENTATION: R software implementing the test is available at https://github.com/handetopa/BBGP.Publisher PDFPeer reviewe

    Genome-wide modeling of transcription kinetics reveals patterns of RNA production delays

    Get PDF
    Genes with similar transcriptional activation kinetics can display very different temporal mRNA profiles because of differences in transcription time, degradation rate, and RNA-processing kinetics. Recent studies have shown that a splicing-associated RNA production delay can be significant. To investigate this issue more generally, it is useful to develop methods applicable to genome-wide datasets. We introduce a joint model of transcriptional activation and mRNA accumulation that can be used for inference of transcription rate, RNA production delay, and degradation rate given data from high-throughput sequencing time course experiments. We combine a mechanistic differential equation model with a nonparametric statistical modeling approach allowing us to capture a broad range of activation kinetics, and we use Bayesian parameter estimation to quantify the uncertainty in estimates of the kinetic parameters. We apply the model to data from estrogen receptor alpha activation in the MCF-7 breast cancer cell line. We use RNA polymerase II ChIP-Seq time course data to characterize transcriptional activation and mRNA-Seq time course data to quantify mature transcripts. We find that 11% of genes with a good signal in the data display a delay of more than 20 min between completing transcription and mature mRNA production. The genes displaying these long delays are significantly more likely to be short. We also find a statistical association between high delay and late intron retention in pre-mRNA data, indicating significant splicing-associated production delays in many genes.Peer reviewe

    A novel variant in SMG9 causes intellectual disability, confirming a role for nonsense-mediated decay components in neurocognitive development

    Get PDF
    Biallelic loss-of-function variants in the SMG9 gene, encoding a regulatory subunit of the mRNA nonsense-mediated decay (NMD) machinery, are reported to cause heart and brain malformation syndrome. Here we report five patients from three unrelated families with intellectual disability (ID) and a novel pathogenic SMG9 c.551 T > C p.(Val184Ala) homozygous missense variant, identified using exome sequencing. Sanger sequencing confirmed recessive segregation in each family. SMG9 c.551T > C p.(Val184Ala) is most likely an autozygous variant identical by descent. Characteristic clinical findings in patients were mild to moderate ID, intention tremor, pyramidal signs, dyspraxia, and ocular manifestations. We used RNA sequencing of patients and age- and sex-matched healthy controls to assess the effect of the variant. RNA sequencing revealed that the SMG9 c.551T > C variant did not affect the splicing or expression level of SMG9 gene products, and allele-specific expression analysis did not provide evidence that the nonsense mRNA-induced NMD was affected. Differential gene expression analysis identified prevalent upregulation of genes in patients, including the genes SMOX, OSBP2, GPX3, and ZNF155. These findings suggest that normal SMG9 function may be involved in transcriptional regulation without affecting nonsense mRNA-induced NMD. In conclusion, we demonstrate that the SMG9 c.551T > C missense variant causes a neurodevelopmental disorder and impacts gene expression. NMD components have roles beyond aberrant mRNA degradation that are crucial for neurocognitive development.Peer reviewe

    Analysis of differential splicing suggests different modes of short-term splicing regulation

    Get PDF
    MOTIVATION: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-sequencing time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels. RESULTS: Using estrogen receptor α signaling response as a model system, our Gaussian process-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change whereas others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment. AVAILABILITY AND IMPLEMENTATION: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/Peer reviewe

    Gaussian Process Modelling of Genome-wide High-throughput Sequencing Time Series

    No full text
    During the last decade, high-throughput sequencing (HTS) has become the mainstream technique for simultaneously studying enormous number of genetic features present in the genome, transcriptome, or epigenome of an organism. Besides the static experiments which compare genetic features between two or more distinct biological conditions, time series experiments which monitor genetic features over time provide valuable information about the dynamics of complex mechanisms in various biological processes. However, analysis of the currently available HTS time series data sets involves challenges as these data sets often consist of short and irregularly sampled time series which lack sufficient biological replication. In addition, quantification of the genetic features from HTS data is inherently subject to uncertainty due to the limitations of HTS platforms such as short read lengths and varying sequencing depths. This thesis presents a Gaussian process (GP)-based approach for modelling and ranking HTS time series by taking into account the characteristics of the data sets. GPs are one of the most suitable tools for modelling sparse and irregularly sampled time series and they can capture the temporal correlations between observations at different time points via suitable covariance functions. On the other hand, naive application of GP modelling may suffer from over-fitting, leading to increased number of false positives if the characteristics of the data are not taken into account. In this thesis, this problem has been mitigated by regularizing the models by introducing bounds to the hyperparameter values of the GP prior. Firstly, the range of the values of length-scale parameters has been restricted to values compatible with the spacing of the sampled time points. Secondly, application-dependent variance models have been developed to infer the uncertainty levels on the observations, which have then been incorporated into the GP models as lower bounds for the noise variance. Regularizing the GP models by setting realistic bounds to their hyperparameters makes the GP models more robust against the uncertainty in the data without increasing the complexity of the models, and thus makes the method applicable to large genome-wide studies. The publications included in this thesis suggest a number of techniques for modelling the variance in RNA-seq and Pool-seq applications, which are the HTS techniques specifically designed to sequence RNA transcripts and pooled DNA sequences, respectively. Variance models utilize the information obtained through pre-processing stages of the data depending on, for example, the number of replicates or varying sequencing depth levels. Performance evaluation of the GP models under different experiment settings indicates that the variance incorporation into the GP models can yield a higher average precision than the naive application of GP modelling. Motivated by results, an open-source software package, GPrank, has been implemented in R in order to enable researchers to easily apply the proposed GP-based method in their own HTS time series data sets for detecting temporally most active genetic features
    corecore