1,109 research outputs found
Fully Scalable Gaussian Processes using Subspace Inducing Inputs
We introduce fully scalable Gaussian processes, an implementation scheme that
tackles the problem of treating a high number of training instances together
with high dimensional input data. Our key idea is a representation trick over
the inducing variables called subspace inducing inputs. This is combined with
certain matrix-preconditioning based parametrizations of the variational
distributions that lead to simplified and numerically stable variational lower
bounds. Our illustrative applications are based on challenging extreme
multi-label classification problems with the extra burden of the very large
number of class labels. We demonstrate the usefulness of our approach by
presenting predictive performances together with low computational times in
datasets with extremely large number of instances and input dimensions
Thoughts on Massively Scalable Gaussian Processes
We introduce a framework and early results for massively scalable Gaussian
processes (MSGP), significantly extending the KISS-GP approach of Wilson and
Nickisch (2015). The MSGP framework enables the use of Gaussian processes (GPs)
on billions of datapoints, without requiring distributed inference, or severe
assumptions. In particular, MSGP reduces the standard complexity of GP
learning and inference to , and the standard complexity per test
point prediction to . MSGP involves 1) decomposing covariance matrices as
Kronecker products of Toeplitz matrices approximated by circulant matrices.
This multi-level circulant approximation allows one to unify the orthogonal
computational benefits of fast Kronecker and Toeplitz approaches, and is
significantly faster than either approach in isolation; 2) local kernel
interpolation and inducing points to allow for arbitrarily located data inputs,
and test time predictions; 3) exploiting block-Toeplitz Toeplitz-block
structure (BTTB), which enables fast inference and learning when
multidimensional Kronecker structure is not present; and 4) projections of the
input space to flexibly model correlated inputs and high dimensional data. The
ability to handle many () inducing points allows for near-exact
accuracy and large scale kernel learning.Comment: 25 pages, 9 figure
Scaling Gaussian Process Regression with Derivatives
Gaussian processes (GPs) with derivatives are useful in many applications,
including Bayesian optimization, implicit surface reconstruction, and terrain
reconstruction. Fitting a GP to function values and derivatives at points
in dimensions requires linear solves and log determinants with an positive definite matrix -- leading to prohibitive
computations for standard direct methods. We propose
iterative solvers using fast matrix-vector multiplications
(MVMs), together with pivoted Cholesky preconditioning that cuts the iterations
to convergence by several orders of magnitude, allowing for fast kernel
learning and prediction. Our approaches, together with dimensionality
reduction, enables Bayesian optimization with derivatives to scale to
high-dimensional problems and large evaluation budgets.Comment: Appears at Advances in Neural Information Processing Systems 32
(NIPS), 201
Variational Inference for Gaussian Process Models with Linear Complexity
Large-scale Gaussian process inference has long faced practical challenges
due to time and space complexity that is superlinear in dataset size. While
sparse variational Gaussian process models are capable of learning from
large-scale data, standard strategies for sparsifying the model can prevent the
approximation of complex functions. In this work, we propose a novel
variational Gaussian process model that decouples the representation of mean
and covariance functions in reproducing kernel Hilbert space. We show that this
new parametrization generalizes previous models. Furthermore, it yields a
variational inference problem that can be solved by stochastic gradient ascent
with time and space complexity that is only linear in the number of mean
function parameters, regardless of the choice of kernels, likelihoods, and
inducing points. This strategy makes the adoption of large-scale expressive
Gaussian process models possible. We run several experiments on regression
tasks and show that this decoupled approach greatly outperforms previous sparse
variational Gaussian process inference procedures
Constant-Time Predictive Distributions for Gaussian Processes
One of the most compelling features of Gaussian process (GP) regression is
its ability to provide well-calibrated posterior distributions. Recent advances
in inducing point methods have sped up GP marginal likelihood and posterior
mean computations, leaving posterior covariance estimation and sampling as the
remaining computational bottlenecks. In this paper we address these
shortcomings by using the Lanczos algorithm to rapidly approximate the
predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs
Variance Estimates), substantially improves time and space complexity. In our
experiments, LOVE computes covariances up to 2,000 times faster and draws
samples 18,000 times faster than existing methods, all without sacrificing
accuracy.Comment: ICML 201
Product Kernel Interpolation for Scalable Gaussian Processes
Recent work shows that inference for Gaussian processes can be performed
efficiently using iterative methods that rely only on matrix-vector
multiplications (MVMs). Structured Kernel Interpolation (SKI) exploits these
techniques by deriving approximate kernels with very fast MVMs. Unfortunately,
such strategies suffer badly from the curse of dimensionality. We develop a new
technique for MVM based learning that exploits product kernel structure. We
demonstrate that this technique is broadly applicable, resulting in linear
rather than exponential runtime with dimension for SKI, as well as
state-of-the-art asymptotic complexity for multi-task GPs.Comment: Appears in Artificial Intelligence and Statistics (AISTATS) 21, 201
Blitzkriging: Kronecker-structured Stochastic Gaussian Processes
We present Blitzkriging, a new approach to fast inference for Gaussian
processes, applicable to regression, optimisation and classification.
State-of-the-art (stochastic) inference for Gaussian processes on very large
datasets scales cubically in the number of 'inducing inputs', variables
introduced to factorise the model. Blitzkriging shares state-of-the-art scaling
with data, but reduces the scaling in the number of inducing points to
approximately linear. Further, in contrast to other methods, Blitzkriging: does
not force the data to conform to any particular structure (including
grid-like); reduces reliance on error-prone optimisation of inducing point
locations; and is able to learn rich (covariance) structure from the data. We
demonstrate the benefits of our approach on real data in regression,
time-series prediction and signal-interpolation experiments
GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration
Despite advances in scalable models, the inference tools used for Gaussian
processes (GPs) have yet to fully capitalize on developments in computing
hardware. We present an efficient and general approach to GP inference based on
Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified
batched version of the conjugate gradients algorithm to derive all terms for
training and inference in a single call. BBMM reduces the asymptotic complexity
of exact GP inference from to . Adapting this algorithm to
scalable approximations and complex GP models simply requires a routine for
efficient matrix-matrix multiplication with the kernel and its derivative. In
addition, BBMM uses a specialized preconditioner to substantially speed up
convergence. In experiments we show that BBMM effectively uses GPU hardware to
dramatically accelerate both exact GP inference and scalable approximations.
Additionally, we provide GPyTorch, a software platform for scalable GP
inference via BBMM, built on PyTorch.Comment: NeurIPS 201
Stochastic Subspace Descent
We present two stochastic descent algorithms that apply to unconstrained
optimization and are particularly efficient when the objective function is slow
to evaluate and gradients are not easily obtained, as in some PDE-constrained
optimization and machine learning problems. The basic algorithm projects the
gradient onto a random subspace at each iteration, similar to coordinate
descent but without restricting directional derivatives to be along the axes.
This algorithm is previously known but we provide new analysis. We also extend
the popular SVRG method to this framework but without requiring that the
objective function be written as a finite sum. We provide proofs of convergence
for our methods under various convexity assumptions and show favorable results
when compared to gradient descent and BFGS on non-convex problems from the
machine learning and shape optimization literature. We also note that our
analysis gives a proof that the iterates of SVRG and several other popular
first-order stochastic methods, in their original formulation, converge almost
surely to the optimum; to our knowledge, prior to this work the iterates of
SVRG had only been known to converge in expectation.Comment: 34 pages, 7 figures, submitted on 4/1/19 Update: Main document: 24
Pages, Supplementary Material 9 page
Deep Latent-Variable Kernel Learning
Deep kernel learning (DKL) leverages the connection between Gaussian process
(GP) and neural networks (NN) to build an end-to-end, hybrid model. It combines
the capability of NN to learn rich representations under massive data and the
non-parametric property of GP to achieve automatic regularization that
incorporates a trade-off between model fit and model complexity. However, the
deterministic encoder may weaken the model regularization of the following GP
part, especially on small datasets, due to the free latent representation. We
therefore present a complete deep latent-variable kernel learning (DLVKL) model
wherein the latent variables perform stochastic encoding for regularized
representation. We further enhance the DLVKL from two aspects: (i) the
expressive variational posterior through neural stochastic differential
equation (NSDE) to improve the approximation quality, and (ii) the hybrid prior
taking knowledge from both the SDE prior and the posterior to arrive at a
flexible trade-off. Intensive experiments imply that the DLVKL-NSDE performs
similarly to the well calibrated GP on small datasets, and outperforms existing
deep GPs on large datasets.Comment: 13 pages, 8 figures, preprint under revie
- …