Semantic segmentation has made significant progress in recent years thanks to
deep neural networks, but the common objective of generating a single
segmentation output that accurately matches the image's content may not be
suitable for safety-critical domains such as medical diagnostics and autonomous
driving. Instead, multiple possible correct segmentation maps may be required
to reflect the true distribution of annotation maps. In this context,
stochastic semantic segmentation methods must learn to predict conditional
distributions of labels given the image, but this is challenging due to the
typically multimodal distributions, high-dimensional output spaces, and limited
annotation data. To address these challenges, we propose a conditional
categorical diffusion model (CCDM) for semantic segmentation based on Denoising
Diffusion Probabilistic Models. Our model is conditioned to the input image,
enabling it to generate multiple segmentation label maps that account for the
aleatoric uncertainty arising from divergent ground truth annotations. Our
experimental results show that CCDM achieves state-of-the-art performance on
LIDC, a stochastic semantic segmentation dataset, and outperforms established
baselines on the classical segmentation dataset Cityscapes.Comment: Code available at
https://github.com/LarsDoorenbos/ccdm-stochastic-segmentatio