35 research outputs found
Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks
This paper presents a simple method for speech videos generation based on
audio: given a piece of audio, we can generate a video of the target face
speaking this audio. We propose Generative Adversarial Networks (GAN) with cut
speech audio input as condition and use Convolutional Gate Recurrent Unit (GRU)
in generator and discriminator. Our model is trained by exploiting the short
audio and the frames in this duration. For training, we cut the audio and
extract the face in the corresponding frames. We designed a simple encoder and
compare the generated frames using GAN with and without GRU. We use GRU for
temporally coherent frames and the results show that short audio can produce
relatively realistic output results.Comment: 5 pages, 5 figure
Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis
Synthesizing dynamic appearances of humans in motion plays a central role in applications such as ARWR and video editing. While many recent methods have been proposed to tackle this problem,handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper,we propose a video based appearance synthesis method that tackles such challenges and demonstrates high quality results for in-the-wild videos that have not been shown before. Specifically,we adopt a StyleGAN based architecture to the task of person specific video based motion retargeting. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes as well as regularizing the single frame based pose estimates to improve temporal coherency. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the-art performance both qualitatively and quantitatively
C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer
Human video motion transfer (HVMT) aims to synthesize videos that one person
imitates other persons' actions. Although existing GAN-based HVMT methods have
achieved great success, they either fail to preserve appearance details due to
the loss of spatial consistency between synthesized and exemplary images, or
generate incoherent video results due to the lack of temporal consistency among
video frames. In this paper, we propose Coarse-to-Fine Flow Warping Network
(C2F-FWN) for spatial-temporal consistent HVMT. Particularly, C2F-FWN utilizes
coarse-to-fine flow warping and Layout-Constrained Deformable Convolution
(LC-DConv) to improve spatial consistency, and employs Flow Temporal
Consistency (FTC) Loss to enhance temporal consistency. In addition, provided
with multi-source appearance inputs, C2F-FWN can support appearance attribute
editing with great flexibility and efficiency. Besides public datasets, we also
collected a large-scale HVMT dataset named SoloDance for evaluation. Extensive
experiments conducted on our SoloDance dataset and the iPER dataset show that
our approach outperforms state-of-art HVMT methods in terms of both spatial and
temporal consistency. Source code and the SoloDance dataset are available at
https://github.com/wswdx/C2F-FWN.Comment: This work is accepted by AAAI202
Single-Image 3D Human Digitization with Shape-Guided Diffusion
We present an approach to generate a 360-degree view of a person with a
consistent, high-resolution appearance from a single input image. NeRF and its
variants typically require videos or images from different viewpoints. Most
existing approaches taking monocular input either rely on ground-truth 3D scans
for supervision or lack 3D consistency. While recent 3D generative models show
promise of 3D consistent human digitization, these approaches do not generalize
well to diverse clothing appearances, and the results lack photorealism. Unlike
existing work, we utilize high-capacity 2D diffusion models pretrained for
general image synthesis tasks as an appearance prior of clothed humans. To
achieve better 3D consistency while retaining the input identity, we
progressively synthesize multiple views of the human in the input image by
inpainting missing regions with shape-guided diffusion conditioned on
silhouette and surface normal. We then fuse these synthesized multi-view images
via inverse rendering to obtain a fully textured high-resolution 3D mesh of the
given person. Experiments show that our approach outperforms prior methods and
achieves photorealistic 360-degree synthesis of a wide range of clothed humans
with complex textures from a single image.Comment: SIGGRAPH Asia 2023. Project website: https://human-sgd.github.io