4,786 research outputs found
Network estimation in State Space Model with L1-regularization constraint
Biological networks have arisen as an attractive paradigm of genomic science
ever since the introduction of large scale genomic technologies which carried
the promise of elucidating the relationship in functional genomics. Microarray
technologies coupled with appropriate mathematical or statistical models have
made it possible to identify dynamic regulatory networks or to measure time
course of the expression level of many genes simultaneously. However one of the
few limitations fall on the high-dimensional nature of such data coupled with
the fact that these gene expression data are known to include some hidden
process. In that regards, we are concerned with deriving a method for inferring
a sparse dynamic network in a high dimensional data setting. We assume that the
observations are noisy measurements of gene expression in the form of mRNAs,
whose dynamics can be described by some unknown or hidden process. We build an
input-dependent linear state space model from these hidden states and
demonstrate how an incorporated regularization constraint in an
Expectation-Maximization (EM) algorithm can be used to reverse engineer
transcriptional networks from gene expression profiling data. This corresponds
to estimating the model interaction parameters. The proposed method is
illustrated on time-course microarray data obtained from a well established
T-cell data. At the optimum tuning parameters we found genes TRAF5, JUND, CDK4,
CASP4, CD69, and C3X1 to have higher number of inwards directed connections and
FYB, CCNA2, AKT1 and CASP8 to be genes with higher number of outwards directed
connections. We recommend these genes to be object for further investigation.
Caspase 4 is also found to activate the expression of JunD which in turn
represses the cell cycle regulator CDC2.Comment: arXiv admin note: substantial text overlap with arXiv:1308.359
Recommended from our members
Building thermal load prediction through shallow machine learning and deep learning
Building thermal load prediction informs the optimization of cooling plant and thermal energy storage. Physics-based prediction models of building thermal load are constrained by the model and input complexity. In this study, we developed 12 data-driven models (7 shallow learning, 2 deep learning, and 3 heuristic methods) to predict building thermal load and compared shallow machine learning and deep learning. The 12 prediction models were compared with the measured cooling demand. It was found XGBoost (Extreme Gradient Boost) and LSTM (Long Short Term Memory) provided the most accurate load prediction in the shallow and deep learning category, and both outperformed the best baseline model, which uses the previous day's data for prediction. Then, we discussed how the prediction horizon and input uncertainty would influence the load prediction accuracy. Major conclusions are twofold: first, LSTM performs well in short-term prediction (1 h ahead) but not in long term prediction (24 h ahead), because the sequential information becomes less relevant and accordingly not so useful when the prediction horizon is long. Second, the presence of weather forecast uncertainty deteriorates XGBoost's accuracy and favors LSTM, because the sequential information makes the model more robust to input uncertainty. Training the model with the uncertain rather than accurate weather data could enhance the model's robustness. Our findings have two implications for practice. First, LSTM is recommended for short-term load prediction given that weather forecast uncertainty is unavoidable. Second, XGBoost is recommended for long term prediction, and the model should be trained with the presence of input uncertainty
Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data
Recurrent neural networks (RNNs) are nonlinear dynamical models commonly used
in the machine learning and dynamical systems literature to represent complex
dynamical or sequential relationships between variables. More recently, as deep
learning models have become more common, RNNs have been used to forecast
increasingly complicated systems. Dynamical spatio-temporal processes represent
a class of complex systems that can potentially benefit from these types of
models. Although the RNN literature is expansive and highly developed,
uncertainty quantification is often ignored. Even when considered, the
uncertainty is generally quantified without the use of a rigorous framework,
such as a fully Bayesian setting. Here we attempt to quantify uncertainty in a
more formal framework while maintaining the forecast accuracy that makes these
models appealing, by presenting a Bayesian RNN model for nonlinear
spatio-temporal forecasting. Additionally, we make simple modifications to the
basic RNN to help accommodate the unique nature of nonlinear spatio-temporal
data. The proposed model is applied to a Lorenz simulation and two real-world
nonlinear spatio-temporal forecasting applications
Exploring Interpretable LSTM Neural Networks over Multi-Variable Data
For recurrent neural networks trained on time series with target and
exogenous variables, in addition to accurate prediction, it is also desired to
provide interpretable insights into the data. In this paper, we explore the
structure of LSTM recurrent neural networks to learn variable-wise hidden
states, with the aim to capture different dynamics in multi-variable time
series and distinguish the contribution of variables to the prediction. With
these variable-wise hidden states, a mixture attention mechanism is proposed to
model the generative process of the target. Then we develop associated training
methods to jointly learn network parameters, variable and temporal importance
w.r.t the prediction of the target variable. Extensive experiments on real
datasets demonstrate enhanced prediction performance by capturing the dynamics
of different variables. Meanwhile, we evaluate the interpretation results both
qualitatively and quantitatively. It exhibits the prospect as an end-to-end
framework for both forecasting and knowledge extraction over multi-variable
data.Comment: Accepted to International Conference on Machine Learning (ICML), 201
Econometrics meets sentiment : an overview of methodology and applications
The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software
Boosting Techniques for Nonlinear Time Series Models
Many of the popular nonlinear time series models require a priori the choice of parametric functions which are assumed to be appropriate in specific applications. This approach is used mainly in financial applications, when sufficient knowledge is available about the nonlinear structure between the covariates and the response. One principal strategy to investigate a broader class on nonlinear time series is the Nonlinear Additive AutoRegressive (NAAR) model. The NAAR model estimates the lags of a time series as flexible functions in order to detect non-monotone relationships between current observations and past values.
We consider linear and additive models for identifying nonlinear relationships. A componentwise boosting algorithm is applied to simultaneous model fitting, variable selection, and model choice. Thus, with the application of boosting for fitting potentially nonlinear models we address the major issues in time series modelling: lag selection and nonlinearity. By means of simulation we compare the outcomes of boosting to the outcomes obtained through alternative nonparametric methods. Boosting shows an overall strong performance in terms of precise estimations of highly nonlinear lag functions. The forecasting potential of boosting is examined on real data where the target variable is the German industrial
production (IP). In order to improve the model's forecasting
quality we include additional exogenous variables. Thus we address the second major aspect in this paper which concerns the issue of high-dimensionality in models. Allowing additional inputs in the model extends the NAAR model to an even broader class of models, namely the NAARX model. We show that boosting can cope with large models which have many covariates compared to the number of observations
A fast algorithm for detecting gene-gene interactions in genome-wide association studies
With the recent advent of high-throughput genotyping techniques, genetic data
for genome-wide association studies (GWAS) have become increasingly available,
which entails the development of efficient and effective statistical
approaches. Although many such approaches have been developed and used to
identify single-nucleotide polymorphisms (SNPs) that are associated with
complex traits or diseases, few are able to detect gene-gene interactions among
different SNPs. Genetic interactions, also known as epistasis, have been
recognized to play a pivotal role in contributing to the genetic variation of
phenotypic traits. However, because of an extremely large number of SNP-SNP
combinations in GWAS, the model dimensionality can quickly become so
overwhelming that no prevailing variable selection methods are capable of
handling this problem. In this paper, we present a statistical framework for
characterizing main genetic effects and epistatic interactions in a GWAS study.
Specifically, we first propose a two-stage sure independence screening (TS-SIS)
procedure and generate a pool of candidate SNPs and interactions, which serve
as predictors to explain and predict the phenotypes of a complex trait. We also
propose a rates adjusted thresholding estimation (RATE) approach to determine
the size of the reduced model selected by an independence screening.
Regularization regression methods, such as LASSO or SCAD, are then applied to
further identify important genetic effects. Simulation studies show that the
TS-SIS procedure is computationally efficient and has an outstanding finite
sample performance in selecting potential SNPs as well as gene-gene
interactions. We apply the proposed framework to analyze an
ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select
23 active SNPs and 24 active epistatic interactions for the body mass index
variation. It shows the capability of our procedure to resolve the complexity
of genetic control.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS771 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …