38 research outputs found

    EgoFace: Egocentric Face Performance Capture and Videorealistic Reenactment

    No full text
    Face performance capture and reenactment techniques use multiple cameras and sensors, positioned at a distance from the face or mounted on heavy wearable devices. This limits their applications in mobile and outdoor environments. We present EgoFace, a radically new lightweight setup for face performance capture and front-view videorealistic reenactment using a single egocentric RGB camera. Our lightweight setup allows operations in uncontrolled environments, and lends itself to telepresence applications such as video-conferencing from dynamic environments. The input image is projected into a low dimensional latent space of the facial expression parameters. Through careful adversarial training of the parameter-space synthetic rendering, a videorealistic animation is produced. Our problem is challenging as the human visual system is sensitive to the smallest face irregularities that could occur in the final results. This sensitivity is even stronger for video results. Our solution is trained in a pre-processing stage, through a supervised manner without manual annotations. EgoFace captures a wide variety of facial expressions, including mouth movements and asymmetrical expressions. It works under varying illuminations, background, movements, handles people from different ethnicities and can operate in real time

    EventNeRF: Neural Radiance Fields from a Single Colour Event Camera

    Get PDF
    Asynchronously operating event cameras find many applications due to theirhigh dynamic range, no motion blur, low latency and low data bandwidth. Thefield has seen remarkable progress during the last few years, and existingevent-based 3D reconstruction approaches recover sparse point clouds of thescene. However, such sparsity is a limiting factor in many cases, especially incomputer vision and graphics, that has not been addressed satisfactorily sofar. Accordingly, this paper proposes the first approach for 3D-consistent,dense and photorealistic novel view synthesis using just a single colour eventstream as input. At the core of our method is a neural radiance field trainedentirely in a self-supervised manner from events while preserving the originalresolution of the colour event channels. Next, our ray sampling strategy istailored to events and allows for data-efficient training. At test, our methodproduces results in the RGB space at unprecedented quality. We evaluate ourmethod qualitatively and quantitatively on several challenging synthetic andreal scenes and show that it produces significantly denser and more visuallyappealing renderings than the existing methods. We also demonstrate robustnessin challenging scenarios with fast motion and under low lighting conditions. Wewill release our dataset and our source code to facilitate the research field,see https://4dqv.mpi-inf.mpg.de/EventNeRF/.<br

    Neural Voice Puppetry: Audio-driven Facial Reenactment

    Get PDF
    We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples. Our method is not only more general than existing works since we are generic to the input person, but we also show superior visual and lip sync quality compared to photo-realistic audio- and video-driven reenactment techniques

    StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN

    Get PDF
    Generative adversarial models (GANs) continue to produce advances in terms of the visual quality of still images, as well as the learning of temporal correlations. However, few works manage to combine these two interesting capabilities for the synthesis of video content: Most methods require an extensive training dataset in order to learn temporal correlations, while being rather limited in the resolution and visual quality of their output frames. In this paper, we present a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating video content. Our formulation separates the spatial domain, in which individual frames are synthesized, from the temporal domain, in which motion is generated. For the spatial domain we make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. The expressive power of this model allows us to embed our training videos in the StyleGAN latent space. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes. The advantageous properties of the StyleGAN space simplify the discovery of temporal correlations. We demonstrate that it suffices to train our temporal architecture on only 10 minutes of footage of 1 subject for about 6 hours. After training, our model can not only generate new portrait videos for the training subject, but also for any random subject which can be embedded in the StyleGAN space

    f-SfT

    Get PDF

    gCoRF: Generative Compositional Radiance Fields

    Get PDF
    3D generative models of objects enable photorealistic image synthesis with 3Dcontrol. Existing methods model the scene as a global scene representation,ignoring the compositional aspect of the scene. Compositional reasoning canenable a wide variety of editing applications, in addition to enablinggeneralizable 3D reasoning. In this paper, we present a compositionalgenerative model, where each semantic part of the object is represented as anindependent 3D representation learned from only in-the-wild 2D data. We startwith a global generative model (GAN) and learn to decompose it into differentsemantic parts using supervision from 2D segmentation masks. We then learn tocomposite independently sampled parts in order to create coherent globalscenes. Different parts can be independently sampled while keeping the rest ofthe object fixed. We evaluate our method on a wide variety of objects and partsand demonstrate editing applications.<br

    Learning Complete {3D} Morphable Face Models from Images and Videos

    Get PDF
    Most 3D face reconstruction methods rely on 3D morphable models, which disentangle the space of facial deformations into identity geometry, expressions and skin reflectance. These models are typically learned from a limited number of 3D scans and thus do not generalize well across different identities and expressions. We present the first approach to learn complete 3D models of face identity geometry, albedo and expression just from images and videos. The virtually endless collection of such data, in combination with our self-supervised learning-based approach allows for learning face models that generalize beyond the span of existing approaches. Our network design and loss functions ensure a disentangled parameterization of not only identity and albedo, but also, for the first time, an expression basis. Our method also allows for in-the-wild monocular reconstruction at test time. We show that our learned models better generalize and lead to higher quality image-based reconstructions than existing approaches

    XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera

    No full text
    We present a real-time approach for multi-person 3D motion capture at over 30 fps using a single RGB camera. It operates in generic scenes and is robust to difficult occlusions both by other people and objects. Our method operates in subsequent stages. The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy. In the second stage, a fully-connected neural network turns the possibly partial (on account of occlusion) 2D pose and 3D pose features for each subject into a complete 3D pose estimate per individual. The third stage applies space-time skeletal model fitting to the predicted 2D and 3D pose per subject to further reconcile the 2D and 3D pose, and enforce temporal coherence. Our method returns the full skeletal pose in joint angles for each subject. This is a further key distinction from previous work that neither extracted global body positions nor joint angle results of a coherent skeleton in real time for multi-person scenes. The proposed system runs on consumer hardware at a previously unseen speed of more than 30 fps given 512x320 images as input while achieving state-of-the-art accuracy, which we will demonstrate on a range of challenging real-world scenes

    {StyleRig}: {R}igging {StyleGAN} for {3D} Control Over Portrait Images

    Get PDF
    corecore