A good metric, which promises a reliable comparison between solutions, is
essential for any well-defined task. Unlike most vision tasks that have
per-sample ground-truth, image synthesis tasks target generating unseen data
and hence are usually evaluated through a distributional distance between one
set of real samples and another set of generated samples. This study presents
an empirical investigation into the evaluation of synthesis performance, with
generative adversarial networks (GANs) as a representative of generative
models. In particular, we make in-depth analyses of various factors, including
how to represent a data point in the representation space, how to calculate a
fair distance using selected samples, and how many instances to use from each
set. Extensive experiments conducted on multiple datasets and settings reveal
several important findings. Firstly, a group of models that include both
CNN-based and ViT-based architectures serve as reliable and robust feature
extractors for measurement evaluation. Secondly, Centered Kernel Alignment
(CKA) provides a better comparison across various extractors and hierarchical
layers in one model. Finally, CKA is more sample-efficient and enjoys better
agreement with human judgment in characterizing the similarity between two
internal data correlations. These findings contribute to the development of a
new measurement system, which enables a consistent and reliable re-evaluation
of current state-of-the-art generative models.Comment: NeurIPS 2023 datasets and benchmarks trac