Perceptual metrics, like the Fr\'echet Inception Distance (FID), are widely
used to assess the similarity between synthetically generated and ground truth
(real) images. The key idea behind these metrics is to compute errors in a deep
feature space that captures perceptually and semantically rich image features.
Despite their popularity, the effect that different deep features and their
design choices have on a perceptual metric has not been well studied. In this
work, we perform a causal analysis linking differences in semantic attributes
and distortions between face image distributions to Fr\'echet distances (FD)
using several popular deep feature spaces. A key component of our analysis is
the creation of synthetic counterfactual faces using deep face generators. Our
experiments show that the FD is heavily influenced by its feature space's
training dataset and objective function. For example, FD using features
extracted from ImageNet-trained models heavily emphasize hats over regions like
the eyes and mouth. Moreover, FD using features from a face gender classifier
emphasize hair length more than distances in an identity (recognition) feature
space. Finally, we evaluate several popular face generation models across
feature spaces and find that StyleGAN2 consistently ranks higher than other
face generators, except with respect to identity (recognition) features. This
suggests the need for considering multiple feature spaces when evaluating
generative models and using feature spaces that are tuned to nuances of the
domain of interest.Comment: Code and dataset to be released soo