Generative models make huge progress to the photorealistic image synthesis in
recent years. To enable human to steer the image generation process and
customize the output, many works explore the interpretable dimensions of the
latent space in GANs. Existing methods edit the attributes of the output image
such as orientation or color scheme by varying the latent code along certain
directions. However, these methods usually require additional human annotations
for each pretrained model, and they mostly focus on editing global attributes.
In this work, we propose a self-supervised approach to improve the spatial
steerability of GANs without searching for steerable directions in the latent
space or requiring extra annotations. Specifically, we design randomly sampled
Gaussian heatmaps to be encoded into the intermediate layers of generative
models as spatial inductive bias. Along with training the GAN model from
scratch, these heatmaps are being aligned with the emerging attention of the
GAN's discriminator in a self-supervised learning manner. During inference,
human users can intuitively interact with the spatial heatmaps to edit the
output image, such as varying the scene layout or moving objects in the scene.
Extensive experiments show that the proposed method not only enables spatial
editing over human faces, animal faces, outdoor scenes, and complicated indoor
scenes, but also brings improvement in synthesis quality.Comment: This manuscript is a journal extension of our previous conference
work (arXiv:2112.00718), submitted to TPAM