9 research outputs found
A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation
Animating still face images with deep generative models using a speech input
signal is an active research topic and has seen important recent progress.
However, much of the effort has been put into lip syncing and rendering quality
while the generation of natural head motion, let alone the audio-visual
correlation between head motion and speech, has often been neglected. In this
work, we propose a multi-scale audio-visual synchrony loss and a multi-scale
autoregressive GAN to better handle short and long-term correlation between
speech and the dynamics of the head and lips. In particular, we train a stack
of syncer models on multimodal input pyramids and use these models as guidance
in a multi-scale generator network to produce audio-aligned motion unfolding
over diverse time scales. Our generator operates in the facial landmark domain,
which is a standard low-dimensional head representation. The experiments show
significant improvements over the state of the art in head motion dynamics
quality and in multi-scale audio-visual synchrony both in the landmark domain
and in the image domain
SocialInteractionGAN: Multi-person Interaction Sequence Generation
Prediction of human actions in social interactions has important applications
in the design of social robots or artificial avatars. In this paper, we model
human interaction generation as a discrete multi-sequence generation problem
and present SocialInteractionGAN, a novel adversarial architecture for
conditional interaction generation. Our model builds on a recurrent
encoder-decoder generator network and a dual-stream discriminator. This
architecture allows the discriminator to jointly assess the realism of
interactions and that of individual action sequences. Within each stream a
recurrent network operating on short subsequences endows the output signal with
local assessments, better guiding the forthcoming generation. Crucially,
contextual information on interacting participants is shared among agents and
reinjected in both the generation and the discriminator evaluation processes.
We show that the proposed SocialInteractionGAN succeeds in producing high
realism action sequences of interacting people, comparing favorably to a
diversity of recurrent and convolutional discriminator baselines. Evaluations
are conducted using modified Inception Score and Fr{\'e}chet Inception Distance
metrics, that we specifically design for discrete sequential generated data.
The distribution of generated sequences is shown to approach closely that of
real data. In particular our model properly learns the dynamics of interaction
sequences, while exploiting the full range of actions
Autoregressive GAN for Semantic Unconditional Head Motion Generation
We address the task of unconditional head motion generation to animate still
human faces in a low-dimensional semantic space.Deviating from talking head
generation conditioned on audio that seldom puts emphasis on realistic head
motions, we devise a GAN-based architecture that allows obtaining rich head
motion sequences while avoiding known caveats associated with GANs.Namely, the
autoregressive generation of incremental outputs ensures smooth trajectories,
while a multi-scale discriminator on input pairs drives generation toward
better handling of high and low frequency signals and less mode collapse.We
demonstrate experimentally the relevance of the proposed architecture and
compare with models that showed state-of-the-art performances on similar tasks
A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain
SocialInteractionGAN: Multi-person Interaction Sequence Generation
International audiencePrediction of human actions in social interactions has important applications in the design of social robots or artificial avatars. In this paper, we focus on a unimodal representation of interactions and propose to tackle interaction generation in a data-driven fashion. In particular, we model human interaction generation as a discrete multi-sequence generation problem and present SocialInteractionGAN, a novel adversarial architecture for conditional interaction generation. Our model builds on a recurrent encoder-decoder generator network and a dual-stream discriminator, that jointly evaluates the realism of interactions and individual action sequences and operates at different time scales. Crucially, contextual information on interacting participants is shared among agents and reinjected in both the generation and the discriminator evaluation processes. Experiments show that albeit dealing with low dimensional data, SocialInteractionGAN succeeds in producing high realism action sequences of interacting people, comparing favorably to a diversity of recurrent and convolutional discriminator baselines, and we argue that this work will constitute a first stone towards higher dimensional and multimodal interaction generation. Evaluations are conducted using classical GAN metrics, that we specifically adapt for discrete sequential data. Our model is shown to properly learn the dynamics of interaction sequences, while exploiting the full range of available actions
A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation
Animating still face images with deep generative models using a speech input signal is an active research topic and has seen important recent progress. However, much of the effort has been put into lip syncing and rendering quality while the generation of natural head motion, let alone the audio-visual correlation between head motion and speech, has often been neglected. In this work, we propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN to better handle short and long-term correlation between speech and the dynamics of the head and lips. In particular, we train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network to produce audio-aligned motion unfolding over diverse time scales. Our generator operates in the facial landmark domain, which is a standard low-dimensional head representation. The experiments show significant improvements over the state of the art in head motion dynamics quality and in multi-scale audio-visual synchrony both in the landmark domain and in the image domain