Deep generative models have achieved conspicuous progress in realistic image
synthesis with multifarious conditional inputs, while generating diverse yet
high-fidelity images remains a grand challenge in conditional image generation.
This paper presents a versatile framework for conditional image generation
which incorporates the inductive bias of CNNs and powerful sequence modeling of
auto-regression that naturally leads to diverse image generation. Instead of
independently quantizing the features of multiple domains as in prior research,
we design an integrated quantization scheme with a variational regularizer that
mingles the feature discretization in multiple domains, and markedly boosts the
auto-regressive modeling performance. Notably, the variational regularizer
enables to regularize feature distributions in incomparable latent spaces by
penalizing the intra-domain variations of distributions. In addition, we design
a Gumbel sampling strategy that allows to incorporate distribution uncertainty
into the auto-regressive training procedure. The Gumbel sampling substantially
mitigates the exposure bias that often incurs misalignment between the training
and inference stages and severely impairs the inference performance. Extensive
experiments over multiple conditional image generation tasks show that our
method achieves superior diverse image generation performance qualitatively and
quantitatively as compared with the state-of-the-art.Comment: Accepted to ECCV 2022 as Oral Presentatio