18 research outputs found
Enabling scalable stochastic gradient-based inference for Gaussian processes by employing the Unbiased LInear System SolvEr (ULISSE)
In applications of Gaussian processes where quantification of uncertainty is
of primary interest, it is necessary to accurately characterize the posterior
distribution over covariance parameters. This paper proposes an adaptation of
the Stochastic Gradient Langevin Dynamics algorithm to draw samples from the
posterior distribution over covariance parameters with negligible bias and
without the need to compute the marginal likelihood. In Gaussian process
regression, this has the enormous advantage that stochastic gradients can be
computed by solving linear systems only. A novel unbiased linear systems solver
based on parallelizable covariance matrix-vector products is developed to
accelerate the unbiased estimation of gradients. The results demonstrate the
possibility to enable scalable and exact (in a Monte Carlo sense)
quantification of uncertainty in Gaussian processes without imposing any
special structure on the covariance or reducing the number of input vectors.Comment: 10 pages - paper accepted at ICML 201
Bayesian Inference of Log Determinants
The log-determinant of a kernel matrix appears in a variety of machine
learning problems, ranging from determinantal point processes and generalized
Markov random fields, through to the training of Gaussian processes. Exact
calculation of this term is often intractable when the size of the kernel
matrix exceeds a few thousand. In the spirit of probabilistic numerics, we
reinterpret the problem of computing the log-determinant as a Bayesian
inference problem. In particular, we combine prior knowledge in the form of
bounds from matrix theory and evidence derived from stochastic trace estimation
to obtain probabilistic estimates for the log-determinant and its associated
uncertainty within a given computational budget. Beyond its novelty and
theoretic appeal, the performance of our proposal is competitive with
state-of-the-art approaches to approximating the log-determinant, while also
quantifying the uncertainty due to budget-constrained evidence.Comment: 12 pages, 3 figure
MCMC for variationally sparse Gaussian processes
Gaussian process (GP) models form a core part of probabilistic machine
learning. Considerable research effort has been made into attacking three
issues with GP models: how to compute efficiently when the number of data is
large; how to approximate the posterior when the likelihood is not Gaussian and
how to estimate covariance function parameter posteriors. This paper
simultaneously addresses these, using a variational approximation to the
posterior which is sparse in support of the function but otherwise free-form.
The result is a Hybrid Monte-Carlo sampling scheme which allows for a
non-Gaussian approximation over the function values and covariance parameters
simultaneously, with efficient computations based on inducing-point sparse GPs.
Code to replicate each experiment in this paper will be available shortly.JH was funded by an MRC fellowship, AM and ZG by EPSRC grant EP/I036575/1 and a Google Focussed Research award.This is the final version of the article. It first appeared from the Neural Information Processing Systems Foundation via https://papers.nips.cc/paper/5875-mcmc-for-variationally-sparse-gaussian-processe
Adaptive Multiple Importance Sampling for Gaussian Processes
In applications of Gaussian processes where quantification of uncertainty is
a strict requirement, it is necessary to accurately characterize the posterior
distribution over Gaussian process covariance parameters. Normally, this is
done by means of standard Markov chain Monte Carlo (MCMC) algorithms. Motivated
by the issues related to the complexity of calculating the marginal likelihood
that can make MCMC algorithms inefficient, this paper develops an alternative
inference framework based on Adaptive Multiple Importance Sampling (AMIS). This
paper studies the application of AMIS in the case of a Gaussian likelihood, and
proposes the Pseudo-Marginal AMIS for non-Gaussian likelihoods, where the
marginal likelihood is unbiasedly estimated. The results suggest that the
proposed framework outperforms MCMC-based inference of covariance parameters in
a wide range of scenarios and remains competitive for moderately large
dimensional parameter spaces.Comment: 27 page
Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent
Gaussian processes are a powerful framework for quantifying uncertainty and
for sequential decision-making but are limited by the requirement of solving
linear systems. In general, this has a cubic cost in dataset size and is
sensitive to conditioning. We explore stochastic gradient algorithms as a
computationally efficient method of approximately solving these linear systems:
we develop low-variance optimization objectives for sampling from the posterior
and extend these to inducing points. Counterintuitively, stochastic gradient
descent often produces accurate predictions, even in cases where it does not
converge quickly to the optimum. We explain this through a spectral
characterization of the implicit bias from non-convergence. We show that
stochastic gradient descent produces predictive distributions close to the true
posterior both in regions with sufficient data coverage, and in regions
sufficiently far away from the data. Experimentally, stochastic gradient
descent achieves state-of-the-art performance on sufficiently large-scale or
ill-conditioned regression tasks. Its uncertainty estimates match the
performance of significantly more expensive baselines on a large-scale Bayesian
optimization task
Learning Binary Data Representation for Optical Processing Units
Optical Processing Units (OPUs) are computing devices that perform random projections of input data by exploiting the physical phenomenon of scattering a light source through a diffusive medium. Random projections calculated by OPUs have been used successfully for approximating kernel ridge regression for large datasets with low power consumption and at high speed. However, OPUs require the input data to be binary. In this paper, we propose to use shallow and deep neural networks (NN) as binary encoders to perform input data binarization. The difficulty in developing a binarization strategy which is learned in an end-to-end fashion along with kernel ridge regression parameters, is due to the non-differentiability of the operation performed by the OPU. We overcome this difficulty by considering OPUs as a black-box and by employing the REINFORCE gradient estimator, which allows us to calculate the gradient of the loss function with respect to the weights of the binarization encoder and to optimize these together with the parameters of kernel ridge regression with gradient- based optimization. Through our experimental campaign on a variety of tasks and datasets, we show that our method outperforms alternative unsupervised and supervised binarization techniques
Learning Inconsistent Preferences with Kernel Methods
We propose a probabilistic kernel approach for preferential learning from
pairwise duelling data using Gaussian Processes. Different from previous
methods, we do not impose a total order on the item space, hence can capture
more expressive latent preferential structures such as inconsistent preferences
and clusters of comparable items. Furthermore, we prove the universality of the
proposed kernels, i.e. that the corresponding reproducing kernel Hilbert Space
(RKHS) is dense in the space of skew-symmetric preference functions. To
conclude the paper, we provide an extensive set of numerical experiments on
simulated and real-world datasets showcasing the competitiveness of our
proposed method with state-of-the-art
Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations
We consider the task of learning a parametric Continuous Time Markov Chain
(CTMC) sequence model without examples of sequences, where the training data
consists entirely of aggregate steady-state statistics. Making the problem
harder, we assume that the states we wish to predict are unobserved in the
training data. Specifically, given a parametric model over the transition rates
of a CTMC and some known transition rates, we wish to extrapolate its steady
state distribution to states that are unobserved. A technical roadblock to
learn a CTMC from its steady state has been that the chain rule to compute
gradients will not work over the arbitrarily long sequences necessary to reach
steady state ---from where the aggregate statistics are sampled. To overcome
this optimization challenge, we propose -SGD, a principled stochastic
gradient descent method that uses randomly-stopped estimators to avoid infinite
sums required by the steady state computation, while learning even when only a
subset of the CTMC states can be observed. We apply -SGD to a
real-world testbed and synthetic experiments showcasing its accuracy, ability
to extrapolate the steady state distribution to unobserved states under
unobserved conditions (heavy loads, when training under light loads), and
succeeding in difficult scenarios where even a tailor-made extension of
existing methods fails
A Bayesian conjugate gradient method (with Discussion)
A fundamental task in numerical computation is the solution of large linear
systems. The conjugate gradient method is an iterative method which offers
rapid convergence to the solution, particularly when an effective
preconditioner is employed. However, for more challenging systems a substantial
error can be present even after many iterations have been performed. The
estimates obtained in this case are of little value unless further information
can be provided about the numerical error. In this paper we propose a novel
statistical model for this numerical error set in a Bayesian framework. Our
approach is a strict generalisation of the conjugate gradient method, which is
recovered as the posterior mean for a particular choice of prior. The estimates
obtained are analysed with Krylov subspace methods and a contraction result for
the posterior is presented. The method is then analysed in a simulation study
as well as being applied to a challenging problem in medical imaging