13 research outputs found
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
In this paper we address the following question: Can we approximately sample
from a Bayesian posterior distribution if we are only allowed to touch a small
mini-batch of data-items for every sample we generate?. An algorithm based on
the Langevin equation with stochastic gradients (SGLD) was previously proposed
to solve this, but its mixing rate was slow. By leveraging the Bayesian Central
Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it
will sample from a normal approximation of the posterior, while for slow mixing
rates it will mimic the behavior of SGLD with a pre-conditioner matrix. As a
bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic
gradients) and as such an efficient optimizer during burn-in.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Bayesian Dark Knowledge
We consider the problem of Bayesian parameter estimation for deep neural
networks, which is important in problem settings where we may have little data,
and/ or where we need accurate posterior predictive densities, e.g., for
applications involving bandits or active learning. One simple approach to this
is to use online Monte Carlo methods, such as SGLD (stochastic gradient
Langevin dynamics). Unfortunately, such a method needs to store many copies of
the parameters (which wastes memory), and needs to make predictions using many
versions of the model (which wastes time).
We describe a method for "distilling" a Monte Carlo approximation to the
posterior predictive density into a more compact form, namely a single deep
neural network. We compare to two very recent approaches to Bayesian neural
networks, namely an approach based on expectation propagation [Hernandez-Lobato
and Adams, 2015] and an approach based on variational Bayes [Blundell et al.,
2015]. Our method performs better than both of these, is much simpler to
implement, and uses less computation at test time.Comment: final version submitted to NIPS 201
Speed/accuracy trade-offs for modern convolutional object detectors
The goal of this paper is to serve as a guide for selecting a detection
architecture that achieves the right speed/memory/accuracy balance for a given
application and platform. To this end, we investigate various ways to trade
accuracy for speed and memory usage in modern convolutional object detection
systems. A number of successful systems have been proposed in recent years, but
apples-to-apples comparisons are difficult due to different base feature
extractors (e.g., VGG, Residual Networks), different default image resolutions,
as well as different hardware and software platforms. We present a unified
implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016]
and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and
trace out the speed/accuracy trade-off curve created by using alternative
feature extractors and varying other critical parameters such as image size
within each of these meta-architectures. On one extreme end of this spectrum
where speed and memory are critical, we present a detector that achieves real
time speeds and can be deployed on a mobile device. On the opposite end in
which accuracy is critical, we present a detector that achieves
state-of-the-art performance measured on the COCO detection task.Comment: Accepted to CVPR 201
Recommended from our members
Approximate Markov Chain Monte Carlo Algorithms for Large Scale Bayesian Inference
Traditional algorithms for Bayesian posterior inference require processing the entire dataset in each iteration and are quickly getting obsoleted by the proliferation of massive datasets in various application domains. Most successful applications of learning with big data have been with simple minibatch-based algorithms such as Stochastic Gradient Descent, because they are the only ones that can computationally handle today's large datasets. However, by restricting ourselves to these algorithms, we miss out on all the advantages of Bayesian modeling, such as controlling over-fitting, estimating uncertainty and the ability to incorporate prior knowledge. In this thesis, we attempt to scale up Bayesian posterior inference to large datasets by developing a new generation of approximate Markov Chain Monte Carlo algorithms that process only a mini-batch of data to generate each posterior sample. The approximation introduces a bias in the stationary distribution of the Markov chain, but we show that this bias is more than compensated by accelerated burn-in and lower variance due to the ability to generate a larger number of samples per unit of computational time.Our main contributions are the following. First, we develop a fast Metropolis-Hastings (MH) algorithm by approximating each accept/reject decision using a sequential hypothesis test that processes only an adaptive mini-batch of data instead of the complete dataset. Then, we show that the same idea can be used to speed up the slice sampling algorithm. Next, we present a theoretical analysis of Stochastic Gradient Langevin Dynamics (SGLD), a posterior sampling algorithm derived by adding Gaussian noise to Stochastic Gradient Ascent updates. We also show that the bias in SGLD can be reduced by combining it with our approximate MH test. We then propose a new algorithm called Stochastic Gradient Fisher Scoring (SGFS) which improves the mixing rate of SGLD using a preconditioning matrix that captures the curvature of the posterior distribution. Finally, we develop an efficient algorithm for Bayesian Probabilistic Matrix Factorization using a combination of SGLD and approximate Metropolis-Hastings updates
Approximate Markov Chain Monte Carlo Algorithms for Large Scale Bayesian Inference
Traditional algorithms for Bayesian posterior inference require processing the entire dataset in each iteration and are quickly getting obsoleted by the proliferation of massive datasets in various application domains. Most successful applications of learning with big data have been with simple minibatch-based algorithms such as Stochastic Gradient Descent, because they are the only ones that can computationally handle today's large datasets. However, by restricting ourselves to these algorithms, we miss out on all the advantages of Bayesian modeling, such as controlling over-fitting, estimating uncertainty and the ability to incorporate prior knowledge. In this thesis, we attempt to scale up Bayesian posterior inference to large datasets by developing a new generation of approximate Markov Chain Monte Carlo algorithms that process only a mini-batch of data to generate each posterior sample. The approximation introduces a bias in the stationary distribution of the Markov chain, but we show that this bias is more than compensated by accelerated burn-in and lower variance due to the ability to generate a larger number of samples per unit of computational time.Our main contributions are the following. First, we develop a fast Metropolis-Hastings (MH) algorithm by approximating each accept/reject decision using a sequential hypothesis test that processes only an adaptive mini-batch of data instead of the complete dataset. Then, we show that the same idea can be used to speed up the slice sampling algorithm. Next, we present a theoretical analysis of Stochastic Gradient Langevin Dynamics (SGLD), a posterior sampling algorithm derived by adding Gaussian noise to Stochastic Gradient Ascent updates. We also show that the bias in SGLD can be reduced by combining it with our approximate MH test. We then propose a new algorithm called Stochastic Gradient Fisher Scoring (SGFS) which improves the mixing rate of SGLD using a preconditioning matrix that captures the curvature of the posterior distribution. Finally, we develop an efficient algorithm for Bayesian Probabilistic Matrix Factorization using a combination of SGLD and approximate Metropolis-Hastings updates
Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget
Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints twice in order to reach a single binary decision is computationally inefficient. We introduce an approximate Metropolis-Hastings rule based on a sequential hypothesis test which allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time. We show that the same idea can also be applied to Gibbs sampling in densely connected graphs.