7,018 research outputs found
Stochastic Gradient Descent as Approximate Bayesian Inference
Stochastic Gradient Descent with a constant learning rate (constant SGD)
simulates a Markov chain with a stationary distribution. With this perspective,
we derive several new results. (1) We show that constant SGD can be used as an
approximate Bayesian posterior inference algorithm. Specifically, we show how
to adjust the tuning parameters of constant SGD to best match the stationary
distribution to a posterior, minimizing the Kullback-Leibler divergence between
these two distributions. (2) We demonstrate that constant SGD gives rise to a
new variational EM algorithm that optimizes hyperparameters in complex
probabilistic models. (3) We also propose SGD with momentum for sampling and
show how to adjust the damping coefficient accordingly. (4) We analyze MCMC
algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we
quantify the approximation errors due to finite learning rates. Finally (5), we
use the stochastic process perspective to give a short proof of why Polyak
averaging is optimal. Based on this idea, we propose a scalable approximate
MCMC algorithm, the Averaged Stochastic Gradient Sampler.Comment: 35 pages, published version (JMLR 2017
A Variational Analysis of Stochastic Gradient Algorithms
Stochastic Gradient Descent (SGD) is an important algorithm in machine
learning. With constant learning rates, it is a stochastic process that, after
an initial phase of convergence, generates samples from a stationary
distribution. We show that SGD with constant rates can be effectively used as
an approximate posterior inference algorithm for probabilistic modeling.
Specifically, we show how to adjust the tuning parameters of SGD such as to
match the resulting stationary distribution to the posterior. This analysis
rests on interpreting SGD as a continuous-time stochastic process and then
minimizing the Kullback-Leibler divergence between its stationary distribution
and the target posterior. (This is in the spirit of variational inference.) In
more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then
use properties of this process to derive the optimal parameters. This
theoretical framework also connects SGD to modern scalable inference
algorithms; we analyze the recently proposed stochastic gradient Fisher scoring
under this perspective. We demonstrate that SGD with properly chosen constant
rates gives a new way to optimize hyperparameters in probabilistic models.Comment: 8 pages, 3 figure
Privacy-Preserving Deep Learning via Weight Transmission
This paper considers the scenario that multiple data owners wish to apply a
machine learning method over the combined dataset of all owners to obtain the
best possible learning output but do not want to share the local datasets owing
to privacy concerns. We design systems for the scenario that the stochastic
gradient descent (SGD) algorithm is used as the machine learning method because
SGD (or its variants) is at the heart of recent deep learning techniques over
neural networks. Our systems differ from existing systems in the following
features: {\bf (1)} any activation function can be used, meaning that no
privacy-preserving-friendly approximation is required; {\bf (2)} gradients
computed by SGD are not shared but the weight parameters are shared instead;
and {\bf (3)} robustness against colluding parties even in the extreme case
that only one honest party exists. We prove that our systems, while
privacy-preserving, achieve the same learning accuracy as SGD and hence retain
the merit of deep learning with respect to accuracy. Finally, we conduct
several experiments using benchmark datasets, and show that our systems
outperform previous system in terms of learning accuracies.Comment: Full version of a conference paper at NSS 201
Solving differential equations with unknown constitutive relations as recurrent neural networks
We solve a system of ordinary differential equations with an unknown
functional form of a sink (reaction rate) term. We assume that the measurements
(time series) of state variables are partially available, and we use recurrent
neural network to "learn" the reaction rate from this data. This is achieved by
including a discretized ordinary differential equations as part of a recurrent
neural network training problem. We extend TensorFlow's recurrent neural
network architecture to create a simple but scalable and effective solver for
the unknown functions, and apply it to a fedbatch bioreactor simulation
problem. Use of techniques from recent deep learning literature enables
training of functions with behavior manifesting over thousands of time steps.
Our networks are structurally similar to recurrent neural networks, but
differences in design and function require modifications to the conventional
wisdom about training such networks.Comment: 19 pages, 8 figure
Stochastic Backpropagation and Approximate Inference in Deep Generative Models
We marry ideas from deep neural networks and approximate Bayesian inference
to derive a generalised class of deep, directed generative models, endowed with
a new algorithm for scalable inference and learning. Our algorithm introduces a
recognition model to represent approximate posterior distributions, and that
acts as a stochastic encoder of the data. We develop stochastic
back-propagation -- rules for back-propagation through stochastic variables --
and use this to develop an algorithm that allows for joint optimisation of the
parameters of both the generative and recognition model. We demonstrate on
several real-world data sets that the model generates realistic samples,
provides accurate imputations of missing data and is a useful tool for
high-dimensional data visualisation.Comment: Appears In Proceedings of the 31st International Conference on
Machine Learning (ICML), JMLR: W\&CP volume 32, 201
Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit
In deep latent Gaussian models, the latent variable is generated by a
time-inhomogeneous Markov chain, where at each time step we pass the current
state through a parametric nonlinear map, such as a feedforward neural net, and
add a small independent Gaussian perturbation. This work considers the
diffusion limit of such models, where the number of layers tends to infinity,
while the step size and the noise variance tend to zero. The limiting latent
object is an It\^o diffusion process that solves a stochastic differential
equation (SDE) whose drift and diffusion coefficient are implemented by neural
nets. We develop a variational inference framework for these \textit{neural
SDEs} via stochastic automatic differentiation in Wiener space, where the
variational approximations to the posterior are obtained by Girsanov
(mean-shift) transformation of the standard Wiener process and the computation
of gradients is based on the theory of stochastic flows. This permits the use
of black-box SDE solvers and automatic differentiation for end-to-end
inference. Experimental results with synthetic data are provided
A Piecewise Deterministic Markov Process via swaps in hyperspherical coordinates
Recently, a class of stochastic processes known as piecewise deterministic
Markov processes has been used to define continuous-time Markov chain Monte
Carlo algorithms with a number of attractive properties, including
compatibility with stochastic gradients like those typically found in
optimization and variational inference, and high efficiency on certain big data
problems. Not many processes in this class that are capable of targeting
arbitrary invariant distributions are currently known, and within one subclass
all previously known processes utilize linear transition functions. In this
work, we derive a process whose transition function is nonlinear through
solving its Fokker-Planck equation in hyperspherical coordinates. We explore
its behavior on Gaussian targets, as well as a Bayesian logistic regression
model with synthetic data. We discuss implications to both the theory of
piecewise deterministic Markov processes, and to Bayesian statisticians as well
as physicists seeking to use them for simulation-based computation
Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations
Classical numerical methods for solving partial differential equations suffer
from the curse dimensionality mainly due to their reliance on meticulously
generated spatio-temporal grids. Inspired by modern deep learning based
techniques for solving forward and inverse problems associated with partial
differential equations, we circumvent the tyranny of numerical discretization
by devising an algorithm that is scalable to high-dimensions. In particular, we
approximate the unknown solution by a deep neural network which essentially
enables us to benefit from the merits of automatic differentiation. To train
the aforementioned neural network we leverage the well-known connection between
high-dimensional partial differential equations and forward-backward stochastic
differential equations. In fact, independent realizations of a standard
Brownian motion will act as training data. We test the effectiveness of our
approach for a couple of benchmark problems spanning a number of scientific
domains including Black-Scholes-Barenblatt and Hamilton-Jacobi-Bellman
equations, both in 100-dimensions
GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration
Despite advances in scalable models, the inference tools used for Gaussian
processes (GPs) have yet to fully capitalize on developments in computing
hardware. We present an efficient and general approach to GP inference based on
Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified
batched version of the conjugate gradients algorithm to derive all terms for
training and inference in a single call. BBMM reduces the asymptotic complexity
of exact GP inference from to . Adapting this algorithm to
scalable approximations and complex GP models simply requires a routine for
efficient matrix-matrix multiplication with the kernel and its derivative. In
addition, BBMM uses a specialized preconditioner to substantially speed up
convergence. In experiments we show that BBMM effectively uses GPU hardware to
dramatically accelerate both exact GP inference and scalable approximations.
Additionally, we provide GPyTorch, a software platform for scalable GP
inference via BBMM, built on PyTorch.Comment: NeurIPS 201
Advances in Variational Inference
Many modern unsupervised or semi-supervised machine learning algorithms rely
on Bayesian probabilistic models. These models are usually intractable and thus
require approximate inference. Variational inference (VI) lets us approximate a
high-dimensional Bayesian posterior with a simpler variational distribution by
solving an optimization problem. This approach has been successfully used in
various models and large-scale applications. In this review, we give an
overview of recent trends in variational inference. We first introduce standard
mean field variational inference, then review recent advances focusing on the
following aspects: (a) scalable VI, which includes stochastic approximations,
(b) generic VI, which extends the applicability of VI to a large class of
otherwise intractable models, such as non-conjugate models, (c) accurate VI,
which includes variational models beyond the mean field approximation or with
atypical divergences, and (d) amortized VI, which implements the inference over
local latent variables with inference networks. Finally, we provide a summary
of promising future research directions
- …