Accelerating the discovery of novel and more effective therapeutics is an
important pharmaceutical problem in which deep learning is playing an
increasingly significant role. However, real-world drug discovery tasks are
often characterized by a scarcity of labeled data and significant covariate
shift\unicode{x2013}\unicode{x2013}a setting that poses a challenge to
standard deep learning methods. In this paper, we present Q-SAVI, a
probabilistic model able to address these challenges by encoding explicit prior
knowledge of the data-generating process into a prior distribution over
functions, presenting researchers with a transparent and probabilistically
principled way to encode data-driven modeling preferences. Building on a novel,
gold-standard bioactivity dataset that facilitates a meaningful comparison of
models in an extrapolative regime, we explore different approaches to induce
data shift and construct a challenging evaluation setup. We then demonstrate
that using Q-SAVI to integrate contextualized prior knowledge of drug-like
chemical space into the modeling process affords substantial gains in
predictive accuracy and calibration, outperforming a broad range of
state-of-the-art self-supervised pre-training and domain adaptation techniques.Comment: Published in the Proceedings of the 40th International Conference on
Machine Learning (ICML 2023