13 research outputs found

    Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

    Get PDF
    In this paper we address the following question: Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burn-in.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    Bayesian Dark Knowledge

    Get PDF
    We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities, e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time). We describe a method for "distilling" a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [Hernandez-Lobato and Adams, 2015] and an approach based on variational Bayes [Blundell et al., 2015]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time.Comment: final version submitted to NIPS 201

    Speed/accuracy trade-offs for modern convolutional object detectors

    Full text link
    The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.Comment: Accepted to CVPR 201

    Approximate Markov Chain Monte Carlo Algorithms for Large Scale Bayesian Inference

    No full text
    Traditional algorithms for Bayesian posterior inference require processing the entire dataset in each iteration and are quickly getting obsoleted by the proliferation of massive datasets in various application domains. Most successful applications of learning with big data have been with simple minibatch-based algorithms such as Stochastic Gradient Descent, because they are the only ones that can computationally handle today's large datasets. However, by restricting ourselves to these algorithms, we miss out on all the advantages of Bayesian modeling, such as controlling over-fitting, estimating uncertainty and the ability to incorporate prior knowledge. In this thesis, we attempt to scale up Bayesian posterior inference to large datasets by developing a new generation of approximate Markov Chain Monte Carlo algorithms that process only a mini-batch of data to generate each posterior sample. The approximation introduces a bias in the stationary distribution of the Markov chain, but we show that this bias is more than compensated by accelerated burn-in and lower variance due to the ability to generate a larger number of samples per unit of computational time.Our main contributions are the following. First, we develop a fast Metropolis-Hastings (MH) algorithm by approximating each accept/reject decision using a sequential hypothesis test that processes only an adaptive mini-batch of data instead of the complete dataset. Then, we show that the same idea can be used to speed up the slice sampling algorithm. Next, we present a theoretical analysis of Stochastic Gradient Langevin Dynamics (SGLD), a posterior sampling algorithm derived by adding Gaussian noise to Stochastic Gradient Ascent updates. We also show that the bias in SGLD can be reduced by combining it with our approximate MH test. We then propose a new algorithm called Stochastic Gradient Fisher Scoring (SGFS) which improves the mixing rate of SGLD using a preconditioning matrix that captures the curvature of the posterior distribution. Finally, we develop an efficient algorithm for Bayesian Probabilistic Matrix Factorization using a combination of SGLD and approximate Metropolis-Hastings updates

    Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

    Get PDF
    Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints twice in order to reach a single binary decision is computationally inefficient. We introduce an approximate Metropolis-Hastings rule based on a sequential hypothesis test which allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time. We show that the same idea can also be applied to Gibbs sampling in densely connected graphs.
    corecore