26 research outputs found
A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression.
BACKGROUND: The analysis of gene expression from time series underpins many biological studies. Two basic forms of analysis recur for data of this type: removing inactive (quiet) genes from the study and determining which genes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data is drawn from a time series. In this paper we propose a simple model for accounting for the underlying temporal nature of the data based on a Gaussian process. RESULTS: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in gene expression time-series. We present a simple approach which can be used to filter quiet genes, or for the case of time series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankings produced by our regression framework and compare them to a recently proposed hierarchical Bayesian model for the analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showing that the proposed approach considerably outperforms the current state of the art. CONCLUSIONS: Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis of microarray time series. The Gaussian process framework offers a natural way of handling biological replicates and missing values and provides confidence intervals along the estimated curves of gene expression. Therefore, we believe Gaussian processes should be a standard tool in the analysis of gene expression time series
Recommended from our members
SpatialDE: identification of spatially variable genes.
Technological advances have made it possible to measure spatially resolved gene expression at high throughput. However, methods to analyze these data are not established. Here we describe SpatialDE, a statistical test to identify genes with spatial patterns of expression variation from multiplexed imaging or spatial RNA-sequencing data. SpatialDE also implements 'automatic expression histology', a spatial gene-clustering approach that enables expression-based tissue histology
GPfit: An R package for Gaussian Process Model Fitting using a New Optimization Algorithm
Gaussian process (GP) models are commonly used statistical metamodels for
emulating expensive computer simulators. Fitting a GP model can be numerically
unstable if any pair of design points in the input space are close together.
Ranjan, Haynes, and Karsten (2011) proposed a computationally stable approach
for fitting GP models to deterministic computer simulators. They used a genetic
algorithm based approach that is robust but computationally intensive for
maximizing the likelihood. This paper implements a slightly modified version of
the model proposed by Ranjan et al. (2011), as the new R package GPfit. A novel
parameterization of the spatial correlation function and a new multi-start
gradient based optimization algorithm yield optimization that is robust and
typically faster than the genetic algorithm based approach. We present two
examples with R codes to illustrate the usage of the main functions in GPfit.
Several test functions are used for performance comparison with a popular R
package mlegp. GPfit is a free software and distributed under the general
public license, as part of the R software project (R Development Core Team
2012).Comment: 20 pages, 17 image
A stochastic model dissects cell states in biological transition processes
Many biological processes, including differentiation, reprogramming, and disease transformations, involve transitions of cells through distinct states. Direct, unbiased investigation of cell states and their transitions is challenging due to several factors, including limitations of single-cell assays. Here we present a stochastic model of cellular transitions that allows underlying single-cell information, including cell-state-specific parameters and rates governing transitions between states, to be estimated from genome-wide, population-averaged time-course data. The key novelty of our approach lies in specifying latent stochastic models at the single-cell level, and then aggregating these models to give a likelihood that links parameters at the single-cell level to observables at the population level. We apply our approach in the context of reprogramming to pluripotency. This yields new insights, including profiles of two intermediate cell states, that are supported by independent single-cell studies. Our model provides a general conceptual framework for the study of cell transitions, including epigenetic transformations
Increasing Power by Sharing Information from Genetic Background and Treatment in Clustering of Gene Expression Time Series
هذا البحث يطوير طريقة تجميع جديدة تسمح لكل مجموعة لتكون بارامتريسد وفقا لما إذا كان سلوك الجينات عبر الظروف مترابطة أو غير مترابطة. من خلال تحديد الارتباط بين هذه الجينات، والمزيد من المعلومات هو كسب داخل المجموعة حول كيفية الجينات المترابطة. التصلب الجانبي الضموري (ألس) هو اضطراب عصبي لا رجعة فيه يقتل الخلايا العصبية الحركية ويؤدي إلى الموت في غضون 2-3 سنوات من بداية الأعراض. سرعة التقدم لمرضى مختلفة غير متجانسة مع تباين كبير. أظهرت الفئران المعدلة وراثيا SOD1G93A من خلفيات مختلفة (129Sv و C57) الاختلافات الظواهر ثابتة لتطور المرض. التسلسل الهرمي للعمليات الغوسية المستخدمة لتشكيل نموذجية محددة وجينات محددة التباين المشترك بين الجينات. وأظهرت هذه الدراسة حول العثور على بعض ملامح التعبير الجيني هامة ومجموعات من تعبيرات الجينات المرتبطة أو المشتركة معا من أربع مجموعات من البيانات (SOD1G93A و نتغ من 129Sv و C57 الخلفيات). وتظهر دراستنا فعالية تبادل المعلومات بين المكررات وظروف نموذج مختلفة عند النمذجة الجينات سلسلة الوقت التعبير. المزيد من الجينات إثراء تحليل النتيجة وتحليل مسار الأنطولوجيا من بعض المجموعات المحددة لمجموعة معينة قد يؤدي نحو تحديد الميزات الكامنة وراء سرعة التفاضلية تطور المرض.Clustering of gene expression time series gives insight into which genes may be co-regulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic background. This paper develops
a new clustering method that allows each cluster to be parameterised according to whether the behaviour of the genes across conditions is correlated or anti-correlated. By specifying correlation between such genes,more information is gain within the cluster about how the genes interrelate. Amyotrophic lateral sclerosis (ALS) is an irreversible neurodegenerative disorder that kills the motor neurons and results in death within 2 to 3 years from the symptom onset. Speed of progression for different patients are heterogeneous with significant variability. The SOD1G93A transgenic mice from different backgrounds (129Sv and C57) showed consistent phenotypic differences for disease progression. A hierarchy of Gaussian isused processes to model condition-specific and gene-specific temporal co-variances. This study demonstrated about finding some significant gene expression profiles and clusters of associated or co-regulated gene expressions together from four groups of data (SOD1G93A and Ntg from 129Sv and C57 backgrounds). Our study shows the effectiveness of sharing information between replicates and different model conditions when modelling gene expression time series. Further gene enrichment score analysis and ontology pathway analysis of some specified clusters for a particular group may lead toward identifying features underlying the differential speed of disease progression
Emulating dynamic non-linear simulators using Gaussian processes
The dynamic emulation of non-linear deterministic computer codes where the
output is a time series, possibly multivariate, is examined. Such computer
models simulate the evolution of some real-world phenomenon over time, for
example models of the climate or the functioning of the human brain. The models
we are interested in are highly non-linear and exhibit tipping points,
bifurcations and chaotic behaviour. However, each simulation run could be too
time-consuming to perform analyses that require many runs, including
quantifying the variation in model output with respect to changes in the
inputs. Therefore, Gaussian process emulators are used to approximate the
output of the code. To do this, the flow map of the system under study is
emulated over a short time period. Then, it is used in an iterative way to
predict the whole time series. A number of ways are proposed to take into
account the uncertainty of inputs to the emulators, after fixed initial
conditions, and the correlation between them through the time series. The
methodology is illustrated with two examples: the highly non-linear dynamical
systems described by the Lorenz and Van der Pol equations. In both cases, the
predictive performance is relatively high and the measure of uncertainty
provided by the method reflects the extent of predictability in each system
GPrank: an R package for detecting dynamic elements from genome-wide time series
Abstract
Background
Genome-wide high-throughput sequencing (HTS) time series experiments are a powerful tool for monitoring various genomic elements over time. They can be used to monitor, for example, gene or transcript expression with RNA sequencing (RNA-seq), DNA methylation levels with bisulfite sequencing (BS-seq), or abundances of genetic variants in populations with pooled sequencing (Pool-seq). However, because of high experimental costs, the time series data sets often consist of a very limited number of time points with very few or no biological replicates, posing challenges in the data analysis.
Results
Here we present the GPrank R package for modelling genome-wide time series by incorporating variance information obtained during pre-processing of the HTS data using probabilistic quantification methods or from a beta-binomial model using sequencing depth. GPrank is well-suited for analysing both short and irregularly sampled time series. It is based on modelling each time series by two Gaussian process (GP) models, namely, time-dependent and time-independent GP models, and comparing the evidence provided by data under two models by computing their Bayes factor (BF). Genomic elements are then ranked by their BFs, and temporally most dynamic elements can be identified.
Conclusions
Incorporating the variance information helps GPrank avoid false positives without compromising computational efficiency. Fitted models can be easily further explored in a browser. Detection and visualisation of temporally most active dynamic elements in the genome can provide a good starting point for further downstream analyses for increasing our understanding of the studied processes
Gaussian process hyper-parameter estimation using parallel asymptotically independent Markov sampling
Gaussian process emulators of computationally expensive computer codes
provide fast statistical approximations to model physical processes. The
training of these surrogates depends on the set of design points chosen to run
the simulator. Due to computational cost, such training set is bound to be
limited and quantifying the resulting uncertainty in the hyper-parameters of
the emulator by uni-modal distributions is likely to induce bias. In order to
quantify this uncertainty, this paper proposes a computationally efficient
sampler based on an extension of Asymptotically Independent Markov Sampling, a
recently developed algorithm for Bayesian inference. Structural uncertainty of
the emulator is obtained as a by-product of the Bayesian treatment of the
hyper-parameters. Additionally, the user can choose to perform stochastic
optimisation to sample from a neighbourhood of the Maximum a Posteriori
estimate, even in the presence of multimodality. Model uncertainty is also
acknowledged through numerical stabilisation measures by including a nugget
term in the formulation of the probability model. The efficiency of the
proposed sampler is illustrated in examples where multi-modal distributions are
encountered. For the purpose of reproducibility, further development, and use
in other applications the code used to generate the examples is freely
available for download at https://github.com/agarbuno/paims_codesComment: Computational Statistics \& Data Analysis, Volume 103, November 201
GPfit: An R Package for Fitting a Gaussian Process Model to Deterministic Simulator Outputs
Gaussian process (GP) models are commonly used statistical metamodels for emulating expensive computer simulators. Fitting a GP model can be numerically unstable if any pair of design points in the input space are close together. Ranjan, Haynes, and Karsten (2011) proposed a computationally stable approach for fitting GP models to deterministic computer simulators. They used a genetic algorithm based approach that is robust but computationally intensive for maximizing the likelihood. This paper implements a slightly modified version ofthe model proposed by Ranjan et al. (2011 ) in the R package GPfit. A novel parameterization of the spatial correlation function and a clustering based multi-start gradient based optimization algorithm yield robust optimization that is typically faster than the genetic algorithm based approach. We present two examples with R codes to illustrate the usage of the main functions in GPfit . Several test functions are used for performance comparison with the popular R package mlegp . We also use GPfit for a real application, i.e., for emulating the tidal kinetic energy model for the Bay of Fundy, Nova Scotia, Canada. GPfit is free software and distributed under the General Public License and available from the Comprehensive R Archive Network
Transitional annealed adaptive slice sampling for Gaussian process hyper-parameter estimation
Surrogate models have become ubiquitous in science and engineering for their capability of emulating expensive computer codes, necessary to model and investigate complex phenomena. Bayesian emulators based on Gaussian processes adequately quantify the uncertainty that results from the cost of the original simulator, and thus the inability to evaluate it on the whole input space. However, it is common in the literature that only a partial Bayesian analysis is carried out, whereby the underlying hyper-parameters are estimated via gradient-free optimization or genetic algorithms, to name a few methods. On the other hand, maximum a posteriori (MAP) estimation could discard important regions of the hyper-parameter space. In this paper, we carry out a more complete Bayesian inference, that combines Slice Sampling with some recently developed sequential Monte Carlo samplers. The resulting algorithm improves the mixing in the sampling through the delayed-rejection nature of Slice Sampling, the inclusion of an annealing scheme akin to Asymptotically Independent Markov Sampling and parallelization via transitional Markov chain Monte Carlo. Examples related to the estimation of Gaussian process hyper-parameters are presented. For the purpose of reproducibility, further development, and use in other applications, the code to generate the examples in this paper is freely available for download at http://github.com/agarbuno/ta2s2_codes