To train Variational Autoencoders (VAEs) to generate realistic imagery
requires a loss function that reflects human perception of image similarity. We
propose such a loss function based on Watson's perceptual model, which computes
a weighted distance in frequency space and accounts for luminance and contrast
masking. We extend the model to color images, increase its robustness to
translation by using the Fourier Transform, remove artifacts due to splitting
the image into blocks, and make it differentiable. In experiments, VAEs trained
with the new loss function generated realistic, high-quality image samples.
Compared to using the Euclidean distance and the Structural Similarity Index,
the images were less blurry; compared to deep neural network based losses, the
new approach required less computational resources and generated images with
less artifacts.Comment: Published at the 34th Conference on Neural Information Processing
Systems (NeurIPS 2020