23,990 research outputs found
Improving Self-Supervised Single View Depth Estimation by Masking Occlusion
Single view depth estimation models can be trained from video footage using a
self-supervised end-to-end approach with view synthesis as the supervisory
signal. This is achieved with a framework that predicts depth and camera
motion, with a loss based on reconstructing a target video frame from
temporally adjacent frames. In this context, occlusion relates to parts of a
scene that can be observed in the target frame but not in a frame used for
image reconstruction. Since the image reconstruction is based on sampling from
the adjacent frame, and occluded areas by definition cannot be sampled,
reconstructed occluded areas corrupt to the supervisory signal. In previous
work arXiv:1806.01260 occlusion is handled based on reconstruction error; at
each pixel location, only the reconstruction with the lowest error is included
in the loss. The current study aims to determine whether performance
improvements of depth estimation models can be gained by during training only
ignoring those regions that are affected by occlusion.
In this work we introduce occlusion mask, a mask that during training can be
used to specifically ignore regions that cannot be reconstructed due to
occlusions. Occlusion mask is based entirely on predicted depth information. We
introduce two novel loss formulations which incorporate the occlusion mask. The
method and implementation of arXiv:1806.01260 serves as the foundation for our
modifications as well as the baseline in our experiments. We demonstrate that
(i) incorporating occlusion mask in the loss function improves the performance
of single image depth prediction models on the KITTI benchmark. (ii) loss
functions that select from reconstructions based on error are able to ignore
some of the reprojection error caused by object motion
Multiple View Geometry For Video Analysis And Post-production
Multiple view geometry is the foundation of an important class of computer vision techniques for simultaneous recovery of camera motion and scene structure from a set of images. There are numerous important applications in this area. Examples include video post-production, scene reconstruction, registration, surveillance, tracking, and segmentation. In video post-production, which is the topic being addressed in this dissertation, computer analysis of the motion of the camera can replace the currently used manual methods for correctly aligning an artificially inserted object in a scene. However, existing single view methods typically require multiple vanishing points, and therefore would fail when only one vanishing point is available. In addition, current multiple view techniques, making use of either epipolar geometry or trifocal tensor, do not exploit fully the properties of constant or known camera motion. Finally, there does not exist a general solution to the problem of synchronization of N video sequences of distinct general scenes captured by cameras undergoing similar ego-motions, which is the necessary step for video post-production among different input videos. This dissertation proposes several advancements that overcome these limitations. These advancements are used to develop an efficient framework for video analysis and post-production in multiple cameras. In the first part of the dissertation, the novel inter-image constraints are introduced that are particularly useful for scenes where minimal information is available. This result extends the current state-of-the-art in single view geometry techniques to situations where only one vanishing point is available. The property of constant or known camera motion is also described in this dissertation for applications such as calibration of a network of cameras in video surveillance systems, and Euclidean reconstruction from turn-table image sequences in the presence of zoom and focus. We then propose a new framework for the estimation and alignment of camera motions, including both simple (panning, tracking and zooming) and complex (e.g. hand-held) camera motions. Accuracy of these results is demonstrated by applying our approach to video post-production applications such as video cut-and-paste and shadow synthesis. As realistic image-based rendering problems, these applications require extreme accuracy in the estimation of camera geometry, the position and the orientation of the light source, and the photometric properties of the resulting cast shadows. In each case, the theoretical results are fully supported and illustrated by both numerical simulations and thorough experimentation on real data
Video Primal Sketch: A Unified Middle-Level Representation for Video
This paper presents a middle-level video representation named Video Primal
Sketch (VPS), which integrates two regimes of models: i) sparse coding model
using static or moving primitives to explicitly represent moving corners,
lines, feature points, etc., ii) FRAME /MRF model reproducing feature
statistics extracted from input video to implicitly represent textured motion,
such as water and fire. The feature statistics include histograms of
spatio-temporal filters and velocity distributions. This paper makes three
contributions to the literature: i) Learning a dictionary of video primitives
using parametric generative models; ii) Proposing the Spatio-Temporal FRAME
(ST-FRAME) and Motion-Appearance FRAME (MA-FRAME) models for modeling and
synthesizing textured motion; and iii) Developing a parsimonious hybrid model
for generic video representation. Given an input video, VPS selects the proper
models automatically for different motion patterns and is compatible with
high-level action representations. In the experiments, we synthesize a number
of textured motion; reconstruct real videos using the VPS; report a series of
human perception experiments to verify the quality of reconstructed videos;
demonstrate how the VPS changes over the scale transition in videos; and
present the close connection between VPS and high-level action models
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Frame Interpolation with Multi-Scale Deep Loss Functions and Generative Adversarial Networks
Frame interpolation attempts to synthesise frames given one or more
consecutive video frames. In recent years, deep learning approaches, and
notably convolutional neural networks, have succeeded at tackling low- and
high-level computer vision problems including frame interpolation. These
techniques often tackle two problems, namely algorithm efficiency and
reconstruction quality. In this paper, we present a multi-scale generative
adversarial network for frame interpolation (\mbox{FIGAN}). To maximise the
efficiency of our network, we propose a novel multi-scale residual estimation
module where the predicted flow and synthesised frame are constructed in a
coarse-to-fine fashion. To improve the quality of synthesised intermediate
video frames, our network is jointly supervised at different levels with a
perceptual loss function that consists of an adversarial and two content
losses. We evaluate the proposed approach using a collection of 60fps videos
from YouTube-8m. Our results improve the state-of-the-art accuracy and provide
subjective visual quality comparable to the best performing interpolation
method at x47 faster runtime
HeadOn: Real-time Reenactment of Human Portrait Videos
We propose HeadOn, the first real-time source-to-target reenactment approach
for complete human portrait videos that enables transfer of torso and head
motion, face expression, and eye gaze. Given a short RGB-D video of the target
actor, we automatically construct a personalized geometry proxy that embeds a
parametric head, eye, and kinematic torso model. A novel real-time reenactment
algorithm employs this proxy to photo-realistically map the captured motion
from the source actor to the target actor. On top of the coarse geometric
proxy, we propose a video-based rendering technique that composites the
modified target portrait video via view- and pose-dependent texturing, and
creates photo-realistic imagery of the target actor under novel torso and head
poses, facial expressions, and gaze directions. To this end, we propose a
robust tracking of the face and torso of the source actor. We extensively
evaluate our approach and show significant improvements in enabling much
greater flexibility in creating realistic reenacted output videos.Comment: Video: https://www.youtube.com/watch?v=7Dg49wv2c_g Presented at
Siggraph'1
Multi-Scale Video Frame-Synthesis Network with Transitive Consistency Loss
Traditional approaches to interpolate/extrapolate frames in a video sequence
require accurate pixel correspondences between images, e.g., using optical
flow. Their results stem on the accuracy of optical flow estimation, and could
generate heavy artifacts when flow estimation failed. Recently methods using
auto-encoder has shown impressive progress, however they are usually trained
for specific interpolation/extrapolation settings and lack of flexibility and
In order to reduce these limitations, we propose a unified network to
parameterize the interest frame position and therefore infer
interpolate/extrapolate frames within the same framework. To achieve this, we
introduce a transitive consistency loss to better regularize the network. We
adopt a multi-scale structure for the network so that the parameters can be
shared across multi-layers. Our approach avoids expensive global optimization
of optical flow methods, and is efficient and flexible for video
interpolation/extrapolation applications. Experimental results have shown that
our method performs favorably against state-of-the-art methods
DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation
Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB
images is a highly challenging problem with numerous applications, ranging from
facial expression recognition to facial reenactment. In this work, we propose
DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense
estimation of 3D non-rigid facial flow between pairs of monocular images. Our
DeepFaceFlow framework was trained and tested on two very large-scale facial
video datasets, one of them of our own collection and annotation, with the aid
of occlusion-aware and 3D-based loss function. We conduct comprehensive
experiments probing different aspects of our approach and demonstrating its
improved performance against state-of-the-art flow and 3D reconstruction
methods. Furthermore, we incorporate our framework in a full-head
state-of-the-art facial video synthesis method and demonstrate the ability of
our method in better representing and capturing the facial dynamics, resulting
in a highly-realistic facial video synthesis. Given registered pairs of images,
our framework generates 3D flow maps at ~60 fps.Comment: to be published in the IEEE conference on Computer Vision and Pattern
Recognition (CVPR). 202
Dynamic Facial Expression Generation on Hilbert Hypersphere with Conditional Wasserstein Generative Adversarial Nets
In this work, we propose a novel approach for generating videos of the six
basic facial expressions given a neutral face image. We propose to exploit the
face geometry by modeling the facial landmarks motion as curves encoded as
points on a hypersphere. By proposing a conditional version of manifold-valued
Wasserstein generative adversarial network (GAN) for motion generation on the
hypersphere, we learn the distribution of facial expression dynamics of
different classes, from which we synthesize new facial expression motions. The
resulting motions can be transformed to sequences of landmarks and then to
images sequences by editing the texture information using another conditional
Generative Adversarial Network. To the best of our knowledge, this is the first
work that explores manifold-valued representations with GAN to address the
problem of dynamic facial expression generation. We evaluate our proposed
approach both quantitatively and qualitatively on two public datasets;
Oulu-CASIA and MUG Facial Expression. Our experimental results demonstrate the
effectiveness of our approach in generating realistic videos with continuous
motion, realistic appearance and identity preservation. We also show the
efficiency of our framework for dynamic facial expressions generation, dynamic
facial expression transfer and data augmentation for training improved emotion
recognition models
Constructing Human Motion Manifold with Sequential Networks
This paper presents a novel recurrent neural network-based method to
construct a latent motion manifold that can represent a wide range of human
motions in a long sequence. We introduce several new components to increase the
spatial and temporal coverage in motion space while retaining the details of
motion capture data. These include new regularization terms for the motion
manifold, combination of two complementary decoders for predicting joint
rotations and joint velocities, and the addition of the forward kinematics
layer to consider both joint rotation and position errors. In addition, we
propose a set of loss terms that improve the overall quality of the motion
manifold from various aspects, such as the capability of reconstructing not
only the motion but also the latent manifold vector, and the naturalness of the
motion through adversarial loss. These components contribute to creating
compact and versatile motion manifold that allows for creating new motions by
performing random sampling and algebraic operations, such as interpolation and
analogy, in the latent motion manifold.Comment: 11 pages, It will be published at Computer Graphics Foru
- …