26 research outputs found

    A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression.

    Get PDF
    BACKGROUND: The analysis of gene expression from time series underpins many biological studies. Two basic forms of analysis recur for data of this type: removing inactive (quiet) genes from the study and determining which genes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data is drawn from a time series. In this paper we propose a simple model for accounting for the underlying temporal nature of the data based on a Gaussian process. RESULTS: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in gene expression time-series. We present a simple approach which can be used to filter quiet genes, or for the case of time series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankings produced by our regression framework and compare them to a recently proposed hierarchical Bayesian model for the analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showing that the proposed approach considerably outperforms the current state of the art. CONCLUSIONS: Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis of microarray time series. The Gaussian process framework offers a natural way of handling biological replicates and missing values and provides confidence intervals along the estimated curves of gene expression. Therefore, we believe Gaussian processes should be a standard tool in the analysis of gene expression time series

    GPfit: An R package for Gaussian Process Model Fitting using a New Optimization Algorithm

    Full text link
    Gaussian process (GP) models are commonly used statistical metamodels for emulating expensive computer simulators. Fitting a GP model can be numerically unstable if any pair of design points in the input space are close together. Ranjan, Haynes, and Karsten (2011) proposed a computationally stable approach for fitting GP models to deterministic computer simulators. They used a genetic algorithm based approach that is robust but computationally intensive for maximizing the likelihood. This paper implements a slightly modified version of the model proposed by Ranjan et al. (2011), as the new R package GPfit. A novel parameterization of the spatial correlation function and a new multi-start gradient based optimization algorithm yield optimization that is robust and typically faster than the genetic algorithm based approach. We present two examples with R codes to illustrate the usage of the main functions in GPfit. Several test functions are used for performance comparison with a popular R package mlegp. GPfit is a free software and distributed under the general public license, as part of the R software project (R Development Core Team 2012).Comment: 20 pages, 17 image

    A stochastic model dissects cell states in biological transition processes

    Get PDF
    Many biological processes, including differentiation, reprogramming, and disease transformations, involve transitions of cells through distinct states. Direct, unbiased investigation of cell states and their transitions is challenging due to several factors, including limitations of single-cell assays. Here we present a stochastic model of cellular transitions that allows underlying single-cell information, including cell-state-specific parameters and rates governing transitions between states, to be estimated from genome-wide, population-averaged time-course data. The key novelty of our approach lies in specifying latent stochastic models at the single-cell level, and then aggregating these models to give a likelihood that links parameters at the single-cell level to observables at the population level. We apply our approach in the context of reprogramming to pluripotency. This yields new insights, including profiles of two intermediate cell states, that are supported by independent single-cell studies. Our model provides a general conceptual framework for the study of cell transitions, including epigenetic transformations

    Increasing Power by Sharing Information from Genetic Background and Treatment in Clustering of Gene Expression Time Series

    Get PDF
    هذا البحث يطوير طريقة تجميع جديدة تسمح لكل مجموعة لتكون بارامتريسد وفقا لما إذا كان سلوك الجينات عبر الظروف مترابطة أو غير مترابطة. من خلال تحديد الارتباط بين هذه الجينات، والمزيد من المعلومات هو كسب داخل المجموعة حول كيفية الجينات المترابطة. التصلب الجانبي الضموري (ألس) هو اضطراب عصبي لا رجعة فيه يقتل الخلايا العصبية الحركية ويؤدي إلى الموت في غضون 2-3 سنوات من بداية الأعراض. سرعة التقدم لمرضى مختلفة غير متجانسة مع تباين كبير. أظهرت الفئران المعدلة وراثيا SOD1G93A من خلفيات مختلفة (129Sv و C57) الاختلافات الظواهر ثابتة لتطور المرض. التسلسل الهرمي للعمليات الغوسية المستخدمة لتشكيل نموذجية محددة وجينات محددة التباين المشترك بين الجينات. وأظهرت هذه الدراسة حول العثور على بعض ملامح التعبير الجيني هامة ومجموعات من تعبيرات الجينات المرتبطة أو المشتركة معا من أربع مجموعات من البيانات (SOD1G93A و نتغ من 129Sv و C57 الخلفيات). وتظهر دراستنا فعالية تبادل المعلومات بين المكررات وظروف نموذج مختلفة عند النمذجة الجينات سلسلة الوقت التعبير. المزيد من الجينات إثراء تحليل النتيجة وتحليل مسار الأنطولوجيا من بعض المجموعات المحددة لمجموعة معينة قد يؤدي نحو تحديد الميزات الكامنة وراء سرعة التفاضلية تطور المرض.Clustering of gene expression time series gives insight into which genes may be co-regulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic background. This paper develops a new clustering method that allows each cluster to be parameterised according to whether the behaviour of the genes across conditions is correlated or anti-correlated. By specifying correlation between such genes,more information is gain within the cluster about how the genes interrelate. Amyotrophic lateral sclerosis (ALS) is an irreversible neurodegenerative disorder that kills the motor neurons and results in death within 2 to 3 years from the symptom onset. Speed of progression for different patients are heterogeneous with significant variability. The SOD1G93A transgenic mice from different backgrounds (129Sv and C57) showed consistent phenotypic differences for disease progression. A hierarchy of Gaussian isused processes to model condition-specific and gene-specific temporal co-variances. This study demonstrated about finding some significant gene expression profiles and clusters of associated or co-regulated gene expressions together from four groups of data (SOD1G93A and Ntg from 129Sv and C57 backgrounds). Our study shows the effectiveness of sharing information between replicates and different model conditions when modelling gene expression time series. Further gene enrichment score analysis and ontology pathway analysis of some specified clusters for a particular group may lead toward identifying features underlying the differential speed of disease progression

    Emulating dynamic non-linear simulators using Gaussian processes

    Get PDF
    The dynamic emulation of non-linear deterministic computer codes where the output is a time series, possibly multivariate, is examined. Such computer models simulate the evolution of some real-world phenomenon over time, for example models of the climate or the functioning of the human brain. The models we are interested in are highly non-linear and exhibit tipping points, bifurcations and chaotic behaviour. However, each simulation run could be too time-consuming to perform analyses that require many runs, including quantifying the variation in model output with respect to changes in the inputs. Therefore, Gaussian process emulators are used to approximate the output of the code. To do this, the flow map of the system under study is emulated over a short time period. Then, it is used in an iterative way to predict the whole time series. A number of ways are proposed to take into account the uncertainty of inputs to the emulators, after fixed initial conditions, and the correlation between them through the time series. The methodology is illustrated with two examples: the highly non-linear dynamical systems described by the Lorenz and Van der Pol equations. In both cases, the predictive performance is relatively high and the measure of uncertainty provided by the method reflects the extent of predictability in each system

    GPrank: an R package for detecting dynamic elements from genome-wide time series

    Get PDF
    Abstract Background Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis. Results Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified. Conclusions Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes

    Gaussian process hyper-parameter estimation using parallel asymptotically independent Markov sampling

    Get PDF
    Gaussian process emulators of computationally expensive computer codes provide fast statistical approximations to model physical processes. The training of these surrogates depends on the set of design points chosen to run the simulator. Due to computational cost, such training set is bound to be limited and quantifying the resulting uncertainty in the hyper-parameters of the emulator by uni-modal distributions is likely to induce bias. In order to quantify this uncertainty, this paper proposes a computationally efficient sampler based on an extension of Asymptotically Independent Markov Sampling, a recently developed algorithm for Bayesian inference. Structural uncertainty of the emulator is obtained as a by-product of the Bayesian treatment of the hyper-parameters. Additionally, the user can choose to perform stochastic optimisation to sample from a neighbourhood of the Maximum a Posteriori estimate, even in the presence of multimodality. Model uncertainty is also acknowledged through numerical stabilisation measures by including a nugget term in the formulation of the probability model. The efficiency of the proposed sampler is illustrated in examples where multi-modal distributions are encountered. For the purpose of reproducibility, further development, and use in other applications the code used to generate the examples is freely available for download at https://github.com/agarbuno/paims_codesComment: Computational Statistics \& Data Analysis, Volume 103, November 201

    GPfit: An R Package for Fitting a Gaussian Process Model to Deterministic Simulator Outputs

    Get PDF
    Gaussian process (GP) models are commonly used statistical metamodels for emulating expensive computer simulators. Fitting a GP model can be numerically unstable if any pair of design points in the input space are close together. Ranjan, Haynes, and Karsten (2011) proposed a computationally stable approach for fitting GP models to deterministic computer simulators. They used a genetic algorithm based approach that is robust but computationally intensive for maximizing the likelihood. This paper implements a slightly modified version ofthe model proposed by Ranjan et al. (2011 ) in the R package GPfit. A novel parameterization of the spatial correlation function and a clustering based multi-start gradient based optimization algorithm yield robust optimization that is typically faster than the genetic algorithm based approach. We present two examples with R codes to illustrate the usage of the main functions in GPfit . Several test functions are used for performance comparison with the popular R package mlegp . We also use GPfit for a real application, i.e., for emulating the tidal kinetic energy model for the Bay of Fundy, Nova Scotia, Canada. GPfit is free software and distributed under the General Public License and available from the Comprehensive R Archive Network

    Transitional annealed adaptive slice sampling for Gaussian process hyper-parameter estimation

    Get PDF
    Surrogate models have become ubiquitous in science and engineering for their capability of emulating expensive computer codes, necessary to model and investigate complex phenomena. Bayesian emulators based on Gaussian processes adequately quantify the uncertainty that results from the cost of the original simulator, and thus the inability to evaluate it on the whole input space. However, it is common in the literature that only a partial Bayesian analysis is carried out, whereby the underlying hyper-parameters are estimated via gradient-free optimization or genetic algorithms, to name a few methods. On the other hand, maximum a posteriori (MAP) estimation could discard important regions of the hyper-parameter space. In this paper, we carry out a more complete Bayesian inference, that combines Slice Sampling with some recently developed sequential Monte Carlo samplers. The resulting algorithm improves the mixing in the sampling through the delayed-rejection nature of Slice Sampling, the inclusion of an annealing scheme akin to Asymptotically Independent Markov Sampling and parallelization via transitional Markov chain Monte Carlo. Examples related to the estimation of Gaussian process hyper-parameters are presented. For the purpose of reproducibility, further development, and use in other applications, the code to generate the examples in this paper is freely available for download at http://github.com/agarbuno/ta2s2_codes
    corecore