Search CORE

1,095 research outputs found

Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis

Author: Li Zhoujun
Mei Tao
Sha Tong
Shen Tong
Zhang Wei
Publication venue
Publication date: 21/08/2023
Field of study

Deep person generation has attracted extensive research attention due to its wide applications in virtual agents, video conferencing, online shopping and art/movie production. With the advancement of deep learning, visual appearances (face, pose, cloth) of a person image can be easily generated or manipulated on demand. In this survey, we first summarize the scope of person generation, and then systematically review recent progress and technical trends in deep person generation, covering three major tasks: talking-head generation (face), pose-guided person generation (pose) and garment-oriented person generation (cloth). More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. Based on these fundamental tasks, a number of applications are investigated, e.g., virtual fitting, digital human, generative data augmentation. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human

arXiv.org e-Print Archive

Animating Through Warping: an Efficient Method for High-Quality Facial Expression Animation

Author: Friesen E
Goodfellow Ian
McKenzie Patricia Anne
Roache Donald
Rössler Andreas
Vougioukas Konstantinos
Wang Xueping
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/08/2020
Field of study

Advances in deep neural networks have considerably improved the art of animating a still image without operating in 3D domain. Whereas, prior arts can only animate small images (typically no larger than 512x512) due to memory limitations, difficulty of training and lack of high-resolution (HD) training datasets, which significantly reduce their potential for applications in movie production and interactive systems. Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images. Specifically, the proposed framework consists of two modules, a novel two-stage neural-network generator and a novel post-processing module known as Animating Through Warping (ATW). It only requires the generator to be trained on small images and can do inference on an image of any size. During inference, an HD input image is decomposed into a low-resolution component(128x128) and its corresponding high-frequency residuals. The generator predicts the low-resolution result as well as the motion field that warps the input face to the desired status (e.g., expressions categories or action units). Finally, the ResWarp module warps the residuals based on the motion field and adding the warped residuals to generates the final HD results from the naively up-sampled low-resolution results. Experiments show the effectiveness and efficiency of our method in generating high-resolution animations. Our proposed framework successfully animates a 4K facial image, which has never been achieved by prior neural models. In addition, our method generally guarantee the temporal coherency of the generated animations. Source codes will be made publicly available.Comment: 18 pages, 13 figures, Accepted to ACM Multimedia 202

arXiv.org e-Print Archive

Crossref

The GAN that warped: semantic attribute editing with unpaired data

Author: Campbell Neill D F
Dorta Garoe
Simpson Ivor J A
Vicente Sara
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject's identity. The robustness of our approach is demonstrated by showing plausible image editing results on the Cub200 birds dataset. To our knowledge this has not been previously accomplished, due the challenging nature of the dataset

arXiv.org e-Print Archive

OPUS

Crossref

Sussex Research Online

Facial expression animation through action units transfer in latent space

Author: Alkawaz MH
Birdwhistell R
Cao C
Cao C
Ekman P
Feng Tian
Housen Cheng
Karras T
Khan A
Mavadati SM
Orvalho V
Simonyan K.
Xiaohui Tan
Yachun Fan
Zhang X
Publication venue: 'Wiley'
Publication date: 01/07/2020
Field of study

Automatic animation synthesis has attracted much attention from the community. As most existing methods take a small number of discrete expressions rather than continuous expressions, their integrity and reality of the facial expressions is often compromised. In addition, the easy manipulation with simple inputs and unsupervised processing, although being important to the automatic facial expression animation applications, is relatively less concerned. To address these issues, we propose an unsupervised continuous automatic facial expression animation approach through action units (AU) transfer in the latent space of generative adversarial networks. The expression descriptor which is depicted with AU vector is transferred into the input image without the need of labeled pairs of images and even without their expressions and further network training. We also propose a new approach to quickly generate input image's latent code and cluster the boundaries of different AU attributes with their latent codes. Two latent code operators, vector addition and continuous interpolation, are leveraged for facial expression animation simulating align with the boundaries in the latent space. Experiments have shown that the proposed approach is effective on facial expression translation and animation synthesis

Crossref

Bournemouth University Research Online

Generation of realistic human behaviour

Author: Vougioukas Konstantinos
Publication venue: Computing, Imperial College London
Publication date: 01/08/2022
Field of study

As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.Open Acces

Spiral - Imperial College Digital Repository