Predictive distributions quantify uncertainties ignored by point estimates.
This paper introduces The Neural Testbed: an open-source benchmark for
controlled and principled evaluation of agents that generate such predictions.
Crucially, the testbed assesses agents not only on the quality of their
marginal predictions per input, but also on their joint predictions across many
inputs. We evaluate a range of agents using a simple neural network data
generating process. Our results indicate that some popular Bayesian deep
learning agents do not fare well with joint predictions, even when they can
produce accurate marginal predictions. We also show that the quality of joint
predictions drives performance in downstream decision tasks. We find these
results are robust across choice a wide range of generative models, and
highlight the practical importance of joint predictions to the community