61 research outputs found
Multiple testing using the posterior probability of half-space: application to gene expression data.
We consider the problem of testing the equality of two sample means, when the number of tests performed is large. Applying this problem to the context of gene expression data, our goal is to detect a set of genes differentially expressed under two treatments or two biological conditions. A null hypothesis of no difference in the gene expression under the two conditions is constructed. Since such a hypothesis is tested for each gene, it follows that thousands of tests are performed simultaneously, and multiple testing issues then arise. The aim of our research is to make a connection between Bayesian analysis and frequentist theory in the context of multiple comparisons by deriving some properties shared by both p-values and posterior probabilities. The ultimate goal of this work is to use the posterior probability of the one-sided alternative hypothesis (or equivalently, posterior probability of the half-space) in the same spirit as a p-value. We show for instance that such a Bayesian probability can be used as an input in some standard multiple testing procedures controlling for the False Discovery rate
Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression
As a regression technique in spatial statistics, the spatiotemporally varying
coefficient model (STVC) is an important tool for discovering nonstationary and
interpretable response-covariate associations over both space and time.
However, it is difficult to apply STVC for large-scale spatiotemporal analyses
due to its high computational cost. To address this challenge, we summarize the
spatiotemporally varying coefficients using a third-order tensor structure and
propose to reformulate the spatiotemporally varying coefficient model as a
special low-rank tensor regression problem. The low-rank decomposition can
effectively model the global patterns of large data sets with a substantially
reduced number of parameters. To further incorporate the local spatiotemporal
dependencies, we use Gaussian process (GP) priors on the spatial and temporal
factor matrices. We refer to the overall framework as Bayesian Kernelized
Tensor Regression (BKTR), and kernelized tensor factorization can be considered
a new and scalable approach to modeling multivariate spatiotemporal processes
with a low-rank covariance structure. For model inference, we develop an
efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling
to update factor matrices and slice sampling to update kernel hyperparameters.
We conduct extensive experiments on both synthetic and real-world data sets,
and our results confirm the superior performance and efficiency of BKTR for
model estimation and parameter inference
Prediction intervals for travel time on transportation networks
Estimating travel-time is essential for making travel decisions in
transportation networks. Empirically, single road-segment travel-time is well
studied, but how to aggregate such information over many edges to arrive at the
distribution of travel time over a route is still theoretically challenging.
Understanding travel-time distribution can help resolve many fundamental
problems in transportation, quantifying travel uncertainty as an example. We
develop a novel statistical perspective to specific types of dynamical
processes that mimic the behavior of travel time on real-world networks. We
show that, under general conditions, travel-time normalized by distance,
follows a Gaussian distribution with route-invariant (universal) location and
scale parameters. We develop efficient inference methods for such parameters,
with which we propose asymptotic universal confidence and prediction intervals
of travel time. We further develop our theory to include road-segment level
information to construct route-specific location and scale parameter sequences
that produce tighter route-specific Gaussian-based prediction intervals. We
illustrate our methods with a real-world case study using precollected mobile
GPS data, where we show that the route-specific and route-invariant intervals
both achieve the 95\% theoretical coverage levels, where the former result in
tighter bounds that also outperform competing models.Comment: 24 main pages, 4 figures and 4 tables. This version includes many
changes to the previous on
Inductive Graph Neural Networks for Spatiotemporal Kriging
Time series forecasting and spatiotemporal kriging are the two most important
tasks in spatiotemporal data analysis. Recent research on graph neural networks
has made substantial progress in time series forecasting, while little
attention has been paid to the kriging problem -- recovering signals for
unsampled locations/sensors. Most existing scalable kriging methods (e.g.,
matrix/tensor completion) are transductive, and thus full retraining is
required when we have a new sensor to interpolate. In this paper, we develop an
Inductive Graph Neural Network Kriging (IGNNK) model to recover data for
unsampled sensors on a network/graph structure. To generalize the effect of
distance and reachability, we generate random subgraphs as samples and
reconstruct the corresponding adjacency matrix for each sample. By
reconstructing all signals on each sample subgraph, IGNNK can effectively learn
the spatial message passing mechanism. Empirical results on several real-world
spatiotemporal datasets demonstrate the effectiveness of our model. In
addition, we also find that the learned model can be successfully transferred
to the same type of kriging tasks on an unseen dataset. Our results show that:
1) GNN is an efficient and effective tool for spatial kriging; 2) inductive
GNNs can be trained using dynamic adjacency matrices; 3) a trained model can be
transferred to new graph structures and 4) IGNNK can be used to generate
virtual sensors.Comment: AAAI 202
Impact of gene expression data pre-processing on expression quantitative trait locus mapping
Hierarchical Inverse Gaussian Models and Multiple Testing: Application to Gene Expression Data
Detecting differentially expressed genes in microarray experiments is a topic that has been well studied in the literature. Many hypothesis testing methods have been proposed that rely on strong distributional assumptions for the gene intensities. However, the shape of microarray data may vary substantially from one experiment to another, and model assumptions may be seriously violated in many cases. The literature on microarray data is mainly based on two distributions: the log-normal and the gamma distributions, that often appear to be effective when used in a Bayesian hierarchical framework. However, if a model that fits the data well in a global manner seems attractive, two points should be regarded with attention: the ability of the model to fit the tail of the observed distribution, and its robustness to a wrong specification of the model, in terms of error rates for the hypothesis tests.In order to focus on these aspects, we propose to use Bayesian models involving the inverse Gaussian distribution to describe gene expression data. We show that these models can be good competitors to the traditional Bayesian or random effect gamma or log-normal models in some situations. A multiple testing procedure is then proposed, based on an asymptotic property of the posterior probability of the one-sided alternative hypothesis. We show that the asymptotic property is well approximated for inverse Gaussian models, even when the number of observations available for each test is very small.
Covariance regression with random forests
Capturing the conditional covariances or correlations among the elements of a
multivariate response vector based on covariates is important to various fields
including neuroscience, epidemiology and biomedicine. We propose a new method
called Covariance Regression with Random Forests (CovRegRF) to estimate the
covariance matrix of a multivariate response given a set of covariates, using a
random forest framework. Random forest trees are built with a splitting rule
specially designed to maximize the difference between the sample covariance
matrix estimates of the child nodes. We also propose a significance test for
the partial effect of a subset of covariates. We evaluate the performance of
the proposed method and significance test through a simulation study which
shows that the proposed method provides accurate covariance matrix estimates
and that the Type-1 error is well controlled. An application of the proposed
method to thyroid disease data is also presented. CovRegRF is implemented in a
freely available R package on CRAN.Comment: 44 page
- …
