5 research outputs found
Continuous Conditional Generative Adversarial Networks for Image Generation: Novel Losses and Label Input Mechanisms
This work proposes the continuous conditional generative adversarial network
(CcGAN), the first generative model for image generation conditional on
continuous, scalar conditions (termed regression labels). Existing conditional
GANs (cGANs) are mainly designed for categorical conditions (eg, class labels);
conditioning on regression labels is mathematically distinct and raises two
fundamental problems:(P1) Since there may be very few (even zero) real images
for some regression labels, minimizing existing empirical versions of cGAN
losses (aka empirical cGAN losses) often fails in practice;(P2) Since
regression labels are scalar and infinitely many, conventional label input
methods are not applicable. The proposed CcGAN solves the above problems,
respectively, by (S1) reformulating existing empirical cGAN losses to be
appropriate for the continuous scenario; and (S2) proposing a naive label input
(NLI) method and an improved label input (ILI) method to incorporate regression
labels into the generator and the discriminator. The reformulation in (S1)
leads to two novel empirical discriminator losses, termed the hard vicinal
discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL)
respectively, and a novel empirical generator loss. The error bounds of a
discriminator trained with HVDL and SVDL are derived under mild assumptions in
this work. Two new benchmark datasets (RC-49 and Cell-200) and a novel
evaluation metric (Sliding Fr\'echet Inception Distance) are also proposed for
this continuous scenario. Our experiments on the Circular 2-D Gaussians, RC-49,
UTKFace, Cell-200, and Steering Angle datasets show that CcGAN is able to
generate diverse, high-quality samples from the image distribution conditional
on a given regression label. Moreover, in these experiments, CcGAN
substantially outperforms cGAN both visually and quantitatively
Generation of realistic human behaviour
As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces