3 research outputs found

    Large-Scale Stochastic Sampling from the Probability Simplex

    Get PDF
    Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular method for scalable Bayesian inference. These methods are based on sampling a discrete-time approximation to a continuous time process, such as the Langevin diffusion. When applied to distributions defined on a constrained space, such as the simplex, the time-discretisation error can dominate when we are near the boundary of the space. We demonstrate that while current SGMCMC methods for the simplex perform well in certain cases, they struggle with sparse simplex spaces; when many of the components are close to zero. However, most popular large-scale applications of Bayesian inference on simplex spaces, such as network or topic models, are sparse. We argue that this poor performance is due to the biases of SGMCMC caused by the discretization error. To get around this, we propose the stochastic CIR process, which removes all discretization error and we prove that samples from the stochastic CIR process are asymptotically unbiased. Use of the stochastic CIR process within a SGMCMC algorithm is shown to give substantially better performance for a topic model and a Dirichlet process mixture model than existing SGMCMC approaches

    Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo

    Get PDF
    Markov chain Monte Carlo (MCMC), one of the most popular methods for inference on Bayesian models, scales poorly with dataset size. This is because it requires one or more calculations over the full dataset at each iteration. Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular MCMC method that aims to be more scalable at large datasets. It only requires a subset of the full data at each iteration. This thesis builds upon the SGMCMC literature by providing contributions that improve the efficiency of SGMCMC; providing software that improves its ease-of-use; and removes large biases in the method for an important class of model. While SGMCMC has improved per-iteration computational cost over traditional MCMC, there have been empirical results suggesting that its overall computational cost (i.e. the cost for the algorithm to reach an arbitrary level of accuracy) is still O(N)O(N), where NN is the dataset size. In light of this, we show how control variates can be used to develop an SGMCMC algorithm of O(1)O(1), subject to two one-off preprocessing steps which each require a single pass through the dataset. While SGMCMC has gained significant popularity in the machine learning community, uptake among the statistics community has been slower. We suggest this may be due to lack of software, so as part of the contributions in this thesis we provide an R software package that automates much of the procedures required to build SGMCMC algorithms. Finally, we show that current algorithms for sampling from the simplex space using SGMCMC have inherent biases, especially when some of the parameter components are close to zero. To get around this, we develop an algorithm that is provably asymptotically unbiased. We empirically demonstrate its performance on a latent Dirichlet allocation model and a Dirichlet process model
    corecore