20 research outputs found
Unmasking Communication Partners: A Low-Cost AI Solution for Digitally Removing Head-Mounted Displays in VR-Based Telepresence
Face-to-face conversation in Virtual Reality (VR) is a challenge when
participants wear head-mounted displays (HMD). A significant portion of a
participant's face is hidden and facial expressions are difficult to perceive.
Past research has shown that high-fidelity face reconstruction with personal
avatars in VR is possible under laboratory conditions with high-cost hardware.
In this paper, we propose one of the first low-cost systems for this task which
uses only open source, free software and affordable hardware. Our approach is
to track the user's face underneath the HMD utilizing a Convolutional Neural
Network (CNN) and generate corresponding expressions with Generative
Adversarial Networks (GAN) for producing RGBD images of the person's face. We
use commodity hardware with low-cost extensions such as 3D-printed mounts and
miniature cameras. Our approach learns end-to-end without manual intervention,
runs in real time, and can be trained and executed on an ordinary gaming
computer. We report evaluation results showing that our low-cost system does
not achieve the same fidelity of research prototypes using high-end hardware
and closed source software, but it is capable of creating individual facial
avatars with person-specific characteristics in movements and expressions.Comment: 9 pages, IEEE 3rd International Conference on Artificial Intelligence
& Virtual Realit
AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars
Capturing and editing full head performances enables the creation of virtual
characters with various applications such as extended reality and media
production. The past few years witnessed a steep rise in the photorealism of
human head avatars. Such avatars can be controlled through different input data
modalities, including RGB, audio, depth, IMUs and others. While these data
modalities provide effective means of control, they mostly focus on editing the
head movements such as the facial expressions, head pose and/or camera
viewpoint. In this paper, we propose AvatarStudio, a text-based method for
editing the appearance of a dynamic full head avatar. Our approach builds on
existing work to capture dynamic performances of human heads using neural
radiance field (NeRF) and edits this representation with a text-to-image
diffusion model. Specifically, we introduce an optimization strategy for
incorporating multiple keyframes representing different camera viewpoints and
time stamps of a video performance into a single diffusion model. Using this
personalized diffusion model, we edit the dynamic NeRF by introducing
view-and-time-aware Score Distillation Sampling (VT-SDS) following a
model-based guidance approach. Our method edits the full head in a canonical
space, and then propagates these edits to remaining time steps via a pretrained
deformation network. We evaluate our method visually and numerically via a user
study, and results show that our method outperforms existing approaches. Our
experiments validate the design choices of our method and highlight that our
edits are genuine, personalized, as well as 3D- and time-consistent.Comment: 17 pages, 17 figures. Project page:
https://vcai.mpi-inf.mpg.de/projects/AvatarStudio
Video-driven Neural Physically-based Facial Asset for Production
Production-level workflows for producing convincing 3D dynamic human faces
have long relied on an assortment of labor-intensive tools for geometry and
texture generation, motion capture and rigging, and expression synthesis.
Recent neural approaches automate individual components but the corresponding
latent representations cannot provide artists with explicit controls as in
conventional tools. In this paper, we present a new learning-based,
video-driven approach for generating dynamic facial geometries with
high-quality physically-based assets. For data collection, we construct a
hybrid multiview-photometric capture stage, coupling with ultra-fast video
cameras to obtain raw 3D facial assets. We then set out to model the facial
expression, geometry and physically-based textures using separate VAEs where we
impose a global MLP based expression mapping across the latent spaces of
respective networks, to preserve characteristics across respective attributes.
We also model the delta information as wrinkle maps for the physically-based
textures, achieving high-quality 4K dynamic textures. We demonstrate our
approach in high-fidelity performer-specific facial capture and cross-identity
facial motion retargeting. In addition, our multi-VAE-based neural asset, along
with the fast adaptation schemes, can also be deployed to handle in-the-wild
videos. Besides, we motivate the utility of our explicit facial disentangling
strategy by providing various promising physically-based editing results with
high realism. Comprehensive experiments show that our technique provides higher
accuracy and visual fidelity than previous video-driven facial reconstruction
and animation methods.Comment: For project page, see https://sites.google.com/view/npfa/ Notice: You
may not copy, reproduce, distribute, publish, display, perform, modify,
create derivative works, transmit, or in any way exploit any such content,
nor may you distribute any part of this content over any network, including a
local area network, sell or offer it for sale, or use such content to
construct any kind of databas
Generative RGB-D face completion for head-mounted display removal
Head-mounted displays (HMDs) are an essential display device for the observation of virtual reality (VR) environments. However, HMDs obstruct external capturing methods from recording the user's upper face. This severely impacts social VR applications, such as teleconferencing, which commonly rely on external RGB-D sensors to capture a volumetric representation of the user. In this paper, we introduce an HMD removal framework based on generative adversarial networks (GANs), capable of jointly filling in missing color and depth data in RGB-D face images. Our framework includes an RGB-based identity loss function for identity preservation and several components aimed at surface reproduction. Our results demonstrate that our framework is able to remove HMDs from synthetic RGB-D face images while preserving the subject's identity