2 research outputs found
Rethinking Spatially-Adaptive Normalization
Spatially-adaptive normalization is remarkably successful recently in
conditional semantic image synthesis, which modulates the normalized activation
with spatially-varying transformations learned from semantic layouts, to
preserve the semantic information from being washed away. Despite its
impressive performance, a more thorough understanding of the true advantages
inside the box is still highly demanded, to help reduce the significant
computation and parameter overheads introduced by these new structures. In this
paper, from a return-on-investment point of view, we present a deep analysis of
the effectiveness of SPADE and observe that its advantages actually come mainly
from its semantic-awareness rather than the spatial-adaptiveness. Inspired by
this point, we propose class-adaptive normalization (CLADE), a lightweight
variant that is not adaptive to spatial positions or layouts. Benefited from
this design, CLADE greatly reduces the computation cost while still being able
to preserve the semantic information during the generation. Extensive
experiments on multiple challenging datasets demonstrate that while the
resulting fidelity is on par with SPADE, its overhead is much cheaper than
SPADE. Take the generator for ADE20k dataset as an example, the extra parameter
and computation cost introduced by CLADE are only 4.57% and 0.07% while that of
SPADE are 39.21% and 234.73% respectively
You Only Need Adversarial Supervision for Semantic Image Synthesis
Despite their recent successes, GAN models for semantic image synthesis still
suffer from poor image quality when trained with only adversarial supervision.
Historically, additionally employing the VGG-based perceptual loss has helped
to overcome this issue, significantly improving the synthesis quality, but at
the same time limiting the progress of GAN models for semantic image synthesis.
In this work, we propose a novel, simplified GAN model, which needs only
adversarial supervision to achieve high quality results. We re-design the
discriminator as a semantic segmentation network, directly using the given
semantic label maps as the ground truth for training. By providing stronger
supervision to the discriminator as well as to the generator through spatially-
and semantically-aware discriminator feedback, we are able to synthesize images
of higher fidelity with better alignment to their input label maps, making the
use of the perceptual loss superfluous. Moreover, we enable high-quality
multi-modal image synthesis through global and local sampling of a 3D noise
tensor injected into the generator, which allows complete or partial image
change. We show that images synthesized by our model are more diverse and
follow the color and texture distributions of real images more closely. We
achieve an average improvement of FID and mIoU points over the state of
the art across different datasets using only adversarial supervision.Comment: Published at ICLR 2021 (Main Conference). Code repository:
https://github.com/boschresearch/OASI