36 research outputs found
EventNeRF: Neural Radiance Fields from a Single Colour Event Camera
Asynchronously operating event cameras find many applications due to their
high dynamic range, no motion blur, low latency and low data bandwidth. The
field has seen remarkable progress during the last few years, and existing
event-based 3D reconstruction approaches recover sparse point clouds of the
scene. However, such sparsity is a limiting factor in many cases, especially in
computer vision and graphics, that has not been addressed satisfactorily so
far. Accordingly, this paper proposes the first approach for 3D-consistent,
dense and photorealistic novel view synthesis using just a single colour event
stream as input. At the core of our method is a neural radiance field trained
entirely in a self-supervised manner from events while preserving the original
resolution of the colour event channels. Next, our ray sampling strategy is
tailored to events and allows for data-efficient training. At test, our method
produces results in the RGB space at unprecedented quality. We evaluate our
method qualitatively and quantitatively on several challenging synthetic and
real scenes and show that it produces significantly denser and more visually
appealing renderings than the existing methods. We also demonstrate robustness
in challenging scenarios with fast motion and under low lighting conditions. We
will release our dataset and our source code to facilitate the research field,
see https://4dqv.mpi-inf.mpg.de/EventNeRF/.Comment: 18 pages, 18 figures, 3 table
Gradient-based 2D-to-3D Conversion for Soccer Videos
A wide spread adoption of 3D videos and technologies is hindered by the lack of high-quality 3D content. One promising solution to address this problem is to use automated 2D-to-3D conversion. However, current conversion methods, while general, produce low-quality results with artifacts that are not acceptable to many viewers. We address this problem by showing how to construct a high-quality, domain-specific conversion method for soccer videos. We propose a novel, data-driven method that generates stereoscopic frames by transferring depth information from similar frames in a database of 3D stereoscopic videos. Creating a database of 3D stereoscopic videos with accurate depth is, however, very difficult. One of the key findings in this paper is showing that computer generated content in current sports computer games can be used to generate high-quality 3D video reference database for 2D-to-3D conversion methods. Once we retrieve similar 3D video frames, our technique transfers depth gradients to the target frame while respecting object boundaries. It then computes depth maps from the gradients, and generates the output stereoscopic video. We implement our method and validate it by conducting user-studies that evaluate depth perception and visual comfort of the converted 3D videos. We show that our method produces high-quality 3D videos that are almost indistinguishable from videos shot by stereo cameras. In addition, our method significantly outperforms the current state-of-the-art method. For example, up to 20% improvement in the perceived depth is achieved by our method, which translates to improving the mean opinion score from Good to Excellent.Qatar Computing Research Institute-CSAIL PartnershipNational Science Foundation (U.S.) (Grant IIS-1111415
GVP: Generative Volumetric Primitives
Advances in 3D-aware generative models have pushed the boundary of image
synthesis with explicit camera control. To achieve high-resolution image
synthesis, several attempts have been made to design efficient generators, such
as hybrid architectures with both 3D and 2D components. However, such a design
compromises multiview consistency, and the design of a pure 3D generator with
high resolution is still an open problem. In this work, we present Generative
Volumetric Primitives (GVP), the first pure 3D generative model that can sample
and render 512-resolution images in real-time. GVP jointly models a number of
volumetric primitives and their spatial information, both of which can be
efficiently generated via a 2D convolutional network. The mixture of these
primitives naturally captures the sparsity and correspondence in the 3D volume.
The training of such a generator with a high degree of freedom is made possible
through a knowledge distillation technique. Experiments on several datasets
demonstrate superior efficiency and 3D consistency of GVP over the
state-of-the-art.Comment: https://vcai.mpi-inf.mpg.de/projects/GVP/index.htm
FML: Face Model Learning from Videos
Monocular image-based 3D reconstruction of faces is a long-standing problem
in computer vision. Since image data is a 2D projection of a 3D face, the
resulting depth ambiguity makes the problem ill-posed. Most existing methods
rely on data-driven priors that are built from limited 3D face scans. In
contrast, we propose multi-frame video-based self-supervised training of a deep
network that (i) learns a face identity model both in shape and appearance
while (ii) jointly learning to reconstruct 3D faces. Our face model is learned
using only corpora of in-the-wild video clips collected from the Internet. This
virtually endless source of training data enables learning of a highly general
3D face model. In order to achieve this, we propose a novel multi-frame
consistency loss that ensures consistent shape and appearance across multiple
frames of a subject's face, thus minimizing depth ambiguity. At test time we
can use an arbitrary number of frames, so that we can perform both monocular as
well as multi-frame reconstruction.Comment: CVPR 2019 (Oral). Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ,
Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19
An Implicit Parametric Morphable Dental Model
3D Morphable models of the human body capture variations among subjects and
are useful in reconstruction and editing applications. Current dental models
use an explicit mesh scene representation and model only the teeth, ignoring
the gum. In this work, we present the first parametric 3D morphable dental
model for both teeth and gum. Our model uses an implicit scene representation
and is learned from rigidly aligned scans. It is based on a component-wise
representation for each tooth and the gum, together with a learnable latent
code for each of such components. It also learns a template shape thus enabling
several applications such as segmentation, interpolation, and tooth
replacement. Our reconstruction quality is on par with the most advanced global
implicit representations while enabling novel applications. Project page:
https://vcai.mpi-inf.mpg.de/projects/DMM
VideoForensicsHQ: Detecting High-quality Manipulated Face Videos
There are concerns that new approaches to the synthesis of high quality face
videos may be misused to manipulate videos with malicious intent. The research
community therefore developed methods for the detection of modified footage and
assembled benchmark datasets for this task. In this paper, we examine how the
performance of forgery detectors depends on the presence of artefacts that the
human eye can see. We introduce a new benchmark dataset for face video forgery
detection, of unprecedented quality. It allows us to demonstrate that existing
detection techniques have difficulties detecting fakes that reliably fool the
human eye. We thus introduce a new family of detectors that examine
combinations of spatial and temporal features and outperform existing
approaches both in terms of detection accuracy and generalization.Comment: ICME 2021 camera-read
LiveHand: Real-time and Photorealistic Neural Hand Rendering
The human hand is the main medium through which we interact with our
surroundings. Hence, its digitization is of uttermost importance, with direct
applications in VR/AR, gaming, and media production amongst other areas. While
there are several works for modeling the geometry and articulations of hands,
little attention has been dedicated to capturing photo-realistic appearance. In
addition, for applications in extended reality and gaming, real-time rendering
is critical. In this work, we present the first neural-implicit approach to
photo-realistically render hands in real-time. This is a challenging problem as
hands are textured and undergo strong articulations with various pose-dependent
effects. However, we show that this can be achieved through our carefully
designed method. This includes training on a low-resolution rendering of a
neural radiance field, together with a 3D-consistent super-resolution module
and mesh-guided space canonicalization and sampling. In addition, we show the
novel application of a perceptual loss on the image space is critical for
achieving photorealism. We show rendering results for several identities, and
demonstrate that our method captures pose- and view-dependent appearance
effects. We also show a live demo of our method where we photo-realistically
render the human hand in real-time for the first time in literature. We ablate
all our design choices and show that our design optimizes for both photorealism
and rendering speed. Our code will be released to encourage further research in
this area.Comment: 11 pages, 8 figure