5 research outputs found
Discovering Graphical Granger Causality Using the Truncating Lasso Penalty
Components of biological systems interact with each other in order to carry
out vital cell functions. Such information can be used to improve estimation
and inference, and to obtain better insights into the underlying cellular
mechanisms. Discovering regulatory interactions among genes is therefore an
important problem in systems biology. Whole-genome expression data over time
provides an opportunity to determine how the expression levels of genes are
affected by changes in transcription levels of other genes, and can therefore
be used to discover regulatory interactions among genes.
In this paper, we propose a novel penalization method, called truncating
lasso, for estimation of causal relationships from time-course gene expression
data. The proposed penalty can correctly determine the order of the underlying
time series, and improves the performance of the lasso-type estimators.
Moreover, the resulting estimate provides information on the time lag between
activation of transcription factors and their effects on regulated genes. We
provide an efficient algorithm for estimation of model parameters, and show
that the proposed method can consistently discover causal relationships in the
large , small setting. The performance of the proposed model is
evaluated favorably in simulated, as well as real, data examples. The proposed
truncating lasso method is implemented in the R-package grangerTlasso and is
available at http://www.stat.lsa.umich.edu/~shojaie.Comment: 12 pages, 4 figures, 1 tabl
Inferring cluster-based networks from differently stimulated multiple time-course gene expression data
Motivation: Clustering and gene network inference often help to predict the biological functions of gene subsets. Recently, researchers have accumulated a large amount of time-course transcriptome data collected under different treatment conditions to understand the physiological states of cells in response to extracellular stimuli and to identify drug-responsive genes. Although a variety of statistical methods for clustering and inferring gene networks from expression profiles have been proposed, most of these are not tailored to simultaneously treat expression data collected under multiple stimulation conditions
Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles
Reconstructing transcriptional regulatory networks is an important task in
functional genomics. Data obtained from experiments that perturb genes by
knockouts or RNA interference contain useful information for addressing this
reconstruction problem. However, such data can be limited in size and/or are
expensive to acquire. On the other hand, observational data of the organism in
steady state (e.g. wild-type) are more readily available, but their
informational content is inadequate for the task at hand. We develop a
computational approach to appropriately utilize both data sources for
estimating a regulatory network. The proposed approach is based on a three-step
algorithm to estimate the underlying directed but cyclic network, that uses as
input both perturbation screens and steady state gene expression data. In the
first step, the algorithm determines causal orderings of the genes that are
consistent with the perturbation data, by combining an exhaustive search method
with a fast heuristic that in turn couples a Monte Carlo technique with a fast
search algorithm. In the second step, for each obtained causal ordering, a
regulatory network is estimated using a penalized likelihood based method,
while in the third step a consensus network is constructed from the highest
scored ones. Extensive computational experiments show that the algorithm
performs well in reconstructing the underlying network and clearly outperforms
competing approaches that rely only on a single data source. Further, it is
established that the algorithm produces a consistent estimate of the regulatory
network.Comment: 24 pages, 4 figures, 6 table
Methods for Reconstructing Networks with Incomplete Information.
Network representations of complex systems are widespread and reconstructing unknown networks from data has been intensively researched in statistical and scientific communities more broadly. Two challenges in network reconstruction problems include having insufficient data to illuminate the full structure of the network and needing to combine information from different data sources. Addressing these challenges, this thesis contributes methodology for network reconstruction in three respects.
First, we consider sequentially choosing interventions to discover structure in directed networks focusing on learning a partial order over the nodes. This focus leads to a new model for intervention data under which nodal variables depend on the lengths of paths separating them from intervention targets rather than on parent sets. Taking a Bayesian approach, we present partial-order based priors and develop a novel Markov-Chain Monte Carlo (MCMC) method for computing posterior expectations over directed acyclic graphs. The utility of the MCMC approach comes from designing new proposals for the Metropolis algorithm that move locally among partial orders while independently sampling graphs from each partial order. The resulting Markov Chains mix rapidly and are ergodic. We also adapt an existing strategy for active structure learning, develop an efficient Monte Carlo procedure for estimating the resulting decision function, and evaluate the proposed methods numerically using simulations and benchmark datasets.
We next study penalized likelihood methods using incomplete order information as arising from intervention data. To make the notion of incomplete information precise, we introduce and formally define incomplete partial orders which subsumes the important special case of a known total ordering of the nodes. This special case lies along an information lattice and we study the reconstruction performance of penalized likelihood methods at different points along this lattice.
Finally, we present a method for ranking a network's potential edges using time-course data. The novelty is our development of a nonparametric gradient-matching procedure and a related summary statistic for measuring the strength of relationships among components in dynamic systems. Simulation studies demonstrate that given sufficient signal moving using this procedure to move from linear to additive approximations leads to improved rankings of potential edges.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113316/1/jbhender_1.pd
Modeling and Estimation of High-dimensional Vector Autoregressions.
Vector Autoregression (VAR) represents a popular class of time series models in applied macroeconomics and finance, widely used for structural analysis and simultaneous forecasting of a number of temporally observed variables. Over the years it has gained popularity in the fields of control theory, statistics, economics, finance, genetics and neuroscience. In addition to the "curse of dimensionality" introduced by a quadratically growing dimension of the parameter space, VAR estimation poses considerable challenges due to the temporal and cross-sectional dependence in the data.
In the first part of this thesis, we discuss modeling and estimation of high-dimensional VAR from short panels of time series, with applications to reconstruction of gene regulatory network from time course gene expression data. We investigate adaptively thresholded lasso regularized estimation of VAR models and propose a thesholded group lasso regularization framework to incorporate a priori available pathway information in the model. The properties of the proposed methods are assessed both theoretically and via numerical experiments. The study is illustrated on two motivating examples coming from functional genomics and financial econometrics.
The second part of this thesis focuses on modeling and estimation of high-dimensional VAR in the traditional time series setting, where one observes a single replicate of a long, stationary time series. We investigate the theoretical properties of l1-regularized and thresholded estimators in high-dimensional VAR, stochastic regression and covariance estimation problems in a non-asymptotic framework. We establish consistency of the estimators under high-dimensional scaling and propose a measure of stability that provides insight into the effect of temporal and cross-sectional dependence on the accuracy of the regularized estimates. We also propose a low-rank plus sparse modeling strategy of high-dimensional VAR in the presence of latent variables. We study the theoretical properties of the proposed estimator in a non-asymptotic framework, establish its estimation consistency under high-dimensional scaling and compare its performance with existing methods via extensive simulation studies.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/109029/1/sumbose_1.pd