2,614 research outputs found

    Text-based Editing of Talking-head Video

    No full text
    Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis

    Capture, Learning, and Synthesis of 3D Speaking Styles

    Full text link
    Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

    FML: Face Model Learning from Videos

    Full text link
    Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ, Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19

    Cross domain Image Transformation and Generation by Deep Learning

    Get PDF
    Compared with single domain learning, cross-domain learning is more challenging due to the large domain variation. In addition, cross-domain image synthesis is more difficult than other cross learning problems, including, for example, correlation analysis, indexing, and retrieval, because it needs to learn complex function which contains image details for photo-realism. This work investigates cross-domain image synthesis in two common and challenging tasks, i.e., image-to-image and non-image-to-image transfer/synthesis.The image-to-image transfer is investigated in Chapter 2, where we develop a method for transformation between face images and sketch images while preserving the identity. Different from existing works that conduct domain transfer in a one-pass manner, we design a recurrent bidirectional transformation network (r-BTN), which allows bidirectional domain transfer in an integrated framework. More importantly, it could perceptually compose partial inputs from two domains to simultaneously synthesize face and sketch images with consistent identity. Most existing works could well synthesize images from patches that cover at least 70% of the original image. The proposed r-BTN could yield appealing results from patches that cover less than 10% because of the recursive estimation of the missing region in an incremental manner. Extensive experiments have been conducted to demonstrate the superior performance of r-BTN as compared to existing solutions.Chapter 3 targets at image transformation/synthesis from non-image sources, i.e., generating talking face based on the audio input. Existing works either do not consider temporal dependency thus yielding abrupt facial/lip movement or are limited to the generation for a specific person thus lacking generalization capacity. A novel conditional recurrent generation network which incorporates image and audio features in the recurrent unit for temporal dependency is proposed such that smooth transition can be achieved for lip and facial movements. To achieve image- and video-realism, we adopt a pair of spatial-temporal discriminators. Accurate lip synchronization is essential to the success of talking face video generation where we construct a lip-reading discriminator to boost the accuracy of lip synchronization. Extensive experiments demonstrate the superiority of our framework over the state-of-the-arts in terms of visual quality, lip sync accuracy, and smooth transition regarding lip and facial movement

    Automatic 3D Facial Performance Acquisition and Animation using Monocular Videos

    Get PDF
    Facial performance capture and animation is an essential component of many applications such as movies, video games, and virtual environments. Video-based facial performance capture is particularly appealing as it offers the lowest cost and the potential use of legacy sources and uncontrolled videos. However, it is also challenging because of complex facial movements at different scales, ambiguity caused by the loss of depth information, and a lack of discernible features on most facial regions. Unknown lighting conditions and camera parameters further complicate the problem. This dissertation explores the video-based 3D facial performance capture systems that use a single video camera, overcome the challenges aforementioned, and produce accurate and robust reconstruction results. We first develop a novel automatic facial feature detection/tracking algorithm that accurately locates important facial features across the entire video sequence, which are then used for 3D pose and facial shape reconstruction. The key idea is to combine the respective powers of local detection, spatial priors for facial feature locations, Active Appearance Models (AAMs), and temporal coherence for facial feature detection. The algorithm runs in realtime and is robust to large pose and expression variations and occlusions. We then present an automatic high-fidelity facial performance capture system that works on monocular videos. It uses the detected facial features along with multilinear facial models to reconstruct 3D head poses and large-scale facial deformation, and uses per-pixel shading cues to add fine-scale surface details such as emerging or disappearing wrinkles and folds. We iterate the reconstruction procedure on large-scale facial geometry and fine-scale facial details to improve the accuracy of facial reconstruction. We further improve the accuracy and efficiency of the large-scale facial performance capture by introducing a local binary feature based 2D feature regression and a convolutional neural network based pose and expression regression, and complement it with an efficient 3D eye gaze tracker to achieve realtime 3D eye gaze animation. We have tested our systems on various monocular videos, demonstrating the accuracy and robustness under a variety of uncontrolled lighting conditions and overcoming significant shape differences across individuals

    Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask Detection

    Full text link
    Anti-spoofing detection has become a necessity for face recognition systems due to the security threat posed by spoofing attacks. Despite great success in traditional attacks, most deep-learning-based methods perform poorly in 3D masks, which can highly simulate real faces in appearance and structure, suffering generalizability insufficiency while focusing only on the spatial domain with single frame input. This has been mitigated by the recent introduction of a biomedical technology called rPPG (remote photoplethysmography). However, rPPG-based methods are sensitive to noisy interference and require at least one second (> 25 frames) of observation time, which induces high computational overhead. To address these challenges, we propose a novel 3D mask detection framework, called FASTEN (Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the network for focusing more on fine-grained details in large movements, which can eliminate redundant spatio-temporal feature interference and quickly capture splicing traces of 3D masks in fewer frames. Our proposed network contains three key modules: 1) a facial optical flow network to obtain non-RGB inter-frame flow information; 2) flow attention to assign different significance to each frame; 3) spatio-temporal aggregation to aggregate high-level spatial features and temporal transition features. Through extensive experiments, FASTEN only requires five frames of input and outperforms eight competitors for both intra-dataset and cross-dataset evaluations in terms of multiple detection metrics. Moreover, FASTEN has been deployed in real-world mobile devices for practical 3D mask detection.Comment: 13 pages, 5 figures. Accepted to NeurIPS 202

    Visibility Constrained Generative Model for Depth-based 3D Facial Pose Tracking

    Full text link
    In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model's visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective and outperforms completing state-of-the-art depth-based methods
    corecore