61 research outputs found

    Multiple testing using the posterior probability of half-space: application to gene expression data.

    Get PDF
    We consider the problem of testing the equality of two sample means, when the number of tests performed is large. Applying this problem to the context of gene expression data, our goal is to detect a set of genes differentially expressed under two treatments or two biological conditions. A null hypothesis of no difference in the gene expression under the two conditions is constructed. Since such a hypothesis is tested for each gene, it follows that thousands of tests are performed simultaneously, and multiple testing issues then arise. The aim of our research is to make a connection between Bayesian analysis and frequentist theory in the context of multiple comparisons by deriving some properties shared by both p-values and posterior probabilities. The ultimate goal of this work is to use the posterior probability of the one-sided alternative hypothesis (or equivalently, posterior probability of the half-space) in the same spirit as a p-value. We show for instance that such a Bayesian probability can be used as an input in some standard multiple testing procedures controlling for the False Discovery rate

    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression

    Full text link
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR), and kernelized tensor factorization can be considered a new and scalable approach to modeling multivariate spatiotemporal processes with a low-rank covariance structure. For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference

    Prediction intervals for travel time on transportation networks

    Full text link
    Estimating travel-time is essential for making travel decisions in transportation networks. Empirically, single road-segment travel-time is well studied, but how to aggregate such information over many edges to arrive at the distribution of travel time over a route is still theoretically challenging. Understanding travel-time distribution can help resolve many fundamental problems in transportation, quantifying travel uncertainty as an example. We develop a novel statistical perspective to specific types of dynamical processes that mimic the behavior of travel time on real-world networks. We show that, under general conditions, travel-time normalized by distance, follows a Gaussian distribution with route-invariant (universal) location and scale parameters. We develop efficient inference methods for such parameters, with which we propose asymptotic universal confidence and prediction intervals of travel time. We further develop our theory to include road-segment level information to construct route-specific location and scale parameter sequences that produce tighter route-specific Gaussian-based prediction intervals. We illustrate our methods with a real-world case study using precollected mobile GPS data, where we show that the route-specific and route-invariant intervals both achieve the 95\% theoretical coverage levels, where the former result in tighter bounds that also outperform competing models.Comment: 24 main pages, 4 figures and 4 tables. This version includes many changes to the previous on

    Inductive Graph Neural Networks for Spatiotemporal Kriging

    Full text link
    Time series forecasting and spatiotemporal kriging are the two most important tasks in spatiotemporal data analysis. Recent research on graph neural networks has made substantial progress in time series forecasting, while little attention has been paid to the kriging problem -- recovering signals for unsampled locations/sensors. Most existing scalable kriging methods (e.g., matrix/tensor completion) are transductive, and thus full retraining is required when we have a new sensor to interpolate. In this paper, we develop an Inductive Graph Neural Network Kriging (IGNNK) model to recover data for unsampled sensors on a network/graph structure. To generalize the effect of distance and reachability, we generate random subgraphs as samples and reconstruct the corresponding adjacency matrix for each sample. By reconstructing all signals on each sample subgraph, IGNNK can effectively learn the spatial message passing mechanism. Empirical results on several real-world spatiotemporal datasets demonstrate the effectiveness of our model. In addition, we also find that the learned model can be successfully transferred to the same type of kriging tasks on an unseen dataset. Our results show that: 1) GNN is an efficient and effective tool for spatial kriging; 2) inductive GNNs can be trained using dynamic adjacency matrices; 3) a trained model can be transferred to new graph structures and 4) IGNNK can be used to generate virtual sensors.Comment: AAAI 202

    Hierarchical Inverse Gaussian Models and Multiple Testing: Application to Gene Expression Data

    No full text
    Detecting differentially expressed genes in microarray experiments is a topic that has been well studied in the literature. Many hypothesis testing methods have been proposed that rely on strong distributional assumptions for the gene intensities. However, the shape of microarray data may vary substantially from one experiment to another, and model assumptions may be seriously violated in many cases. The literature on microarray data is mainly based on two distributions: the log-normal and the gamma distributions, that often appear to be effective when used in a Bayesian hierarchical framework. However, if a model that fits the data well in a global manner seems attractive, two points should be regarded with attention: the ability of the model to fit the tail of the observed distribution, and its robustness to a wrong specification of the model, in terms of error rates for the hypothesis tests.In order to focus on these aspects, we propose to use Bayesian models involving the inverse Gaussian distribution to describe gene expression data. We show that these models can be good competitors to the traditional Bayesian or random effect gamma or log-normal models in some situations. A multiple testing procedure is then proposed, based on an asymptotic property of the posterior probability of the one-sided alternative hypothesis. We show that the asymptotic property is well approximated for inverse Gaussian models, even when the number of observations available for each test is very small.

    Covariance regression with random forests

    Full text link
    Capturing the conditional covariances or correlations among the elements of a multivariate response vector based on covariates is important to various fields including neuroscience, epidemiology and biomedicine. We propose a new method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response given a set of covariates, using a random forest framework. Random forest trees are built with a splitting rule specially designed to maximize the difference between the sample covariance matrix estimates of the child nodes. We also propose a significance test for the partial effect of a subset of covariates. We evaluate the performance of the proposed method and significance test through a simulation study which shows that the proposed method provides accurate covariance matrix estimates and that the Type-1 error is well controlled. An application of the proposed method to thyroid disease data is also presented. CovRegRF is implemented in a freely available R package on CRAN.Comment: 44 page
    corecore