28 research outputs found
Revisiting the Evaluation of Image Synthesis with GANs
A good metric, which promises a reliable comparison between solutions, is
essential for any well-defined task. Unlike most vision tasks that have
per-sample ground-truth, image synthesis tasks target generating unseen data
and hence are usually evaluated through a distributional distance between one
set of real samples and another set of generated samples. This study presents
an empirical investigation into the evaluation of synthesis performance, with
generative adversarial networks (GANs) as a representative of generative
models. In particular, we make in-depth analyses of various factors, including
how to represent a data point in the representation space, how to calculate a
fair distance using selected samples, and how many instances to use from each
set. Extensive experiments conducted on multiple datasets and settings reveal
several important findings. Firstly, a group of models that include both
CNN-based and ViT-based architectures serve as reliable and robust feature
extractors for measurement evaluation. Secondly, Centered Kernel Alignment
(CKA) provides a better comparison across various extractors and hierarchical
layers in one model. Finally, CKA is more sample-efficient and enjoys better
agreement with human judgment in characterizing the similarity between two
internal data correlations. These findings contribute to the development of a
new measurement system, which enables a consistent and reliable re-evaluation
of current state-of-the-art generative models.Comment: NeurIPS 2023 datasets and benchmarks trac
Improving GANs with A Dynamic Discriminator
Discriminator plays a vital role in training generative adversarial networks
(GANs) via distinguishing real and synthesized samples. While the real data
distribution remains the same, the synthesis distribution keeps varying because
of the evolving generator, and thus effects a corresponding change to the
bi-classification task for the discriminator. We argue that a discriminator
with an on-the-fly adjustment on its capacity can better accommodate such a
time-varying task. A comprehensive empirical study confirms that the proposed
training strategy, termed as DynamicD, improves the synthesis performance
without incurring any additional computation cost or training objectives. Two
capacity adjusting schemes are developed for training GANs under different data
regimes: i) given a sufficient amount of training data, the discriminator
benefits from a progressively increased learning capacity, and ii) when the
training data is limited, gradually decreasing the layer width mitigates the
over-fitting issue of the discriminator. Experiments on both 2D and 3D-aware
image synthesis tasks conducted on a range of datasets substantiate the
generalizability of our DynamicD as well as its substantial improvement over
the baselines. Furthermore, DynamicD is synergistic to other
discriminator-improving approaches (including data augmentation, regularizers,
and pre-training), and brings continuous performance gain when combined for
learning GANs.Comment: To appear in NeurIPS 202
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
The development of text-to-video (T2V), i.e., generating videos with a given
text prompt, has been significantly advanced in recent years. However, relying
solely on text prompts often results in ambiguous frame composition due to
spatial uncertainty. The research community thus leverages the dense structure
signals, e.g., per-frame depth/edge sequences, to enhance controllability,
whose collection accordingly increases the burden of inference. In this work,
we present SparseCtrl to enable flexible structure control with temporally
sparse signals, requiring only one or a few inputs, as shown in Figure 1. It
incorporates an additional condition encoder to process these sparse signals
while leaving the pre-trained T2V model untouched. The proposed approach is
compatible with various modalities, including sketches, depth maps, and RGB
images, providing more practical control for video generation and promoting
applications such as storyboarding, depth rendering, keyframe animation, and
interpolation. Extensive experiments demonstrate the generalization of
SparseCtrl on both original and personalized T2V generators. Codes and models
will be publicly available at https://guoyww.github.io/projects/SparseCtrl .Comment: Project page: https://guoyww.github.io/projects/SparseCtr