On the reversed bias-variance tradeoff in deep ensembles

Abstract

Deep ensembles aggregate predictions of diverse neural networks to improve generalisation and quantify uncertainty. Here, we investigate their behavior when increasing the ensemble members’ parameter size - a practice typically associated with better performance for single models. We show that under practical assumptions in the overparametrized regime far into the double descent curve, not only the ensemble test loss degrades, but common out-of-distribution detection and calibration metrics suffer as well. Reminiscent to deep double descent, we observe this phenomenon not only when increasing the single member’s capacity but also as we increase the training budget, suggesting deep ensembles can benefit from early stopping. This sheds light on the success and failure modes of deep ensembles and suggests that averaging finite width models perform better than the neural tangent kernel limit for these metrics

    Similar works