1,716 research outputs found
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
In this work, we present a multimodal solution to the problem of 4D face
reconstruction from monocular videos. 3D face reconstruction from 2D images is
an under-constrained problem due to the ambiguity of depth. State-of-the-art
methods try to solve this problem by leveraging visual information from a
single image or video, whereas 3D mesh animation approaches rely more on audio.
However, in most cases (e.g. AR/VR applications), videos include both visual
and speech information. We propose AVFace that incorporates both modalities and
accurately reconstructs the 4D facial and lip motion of any speaker, without
requiring any 3D ground truth for training. A coarse stage estimates the
per-frame parameters of a 3D morphable model, followed by a lip refinement, and
then a fine stage recovers facial geometric details. Due to the temporal audio
and video information captured by transformer-based modules, our method is
robust in cases when either modality is insufficient (e.g. face occlusions).
Extensive qualitative and quantitative evaluation demonstrates the superiority
of our method over the current state-of-the-art
Reconstruction and Synthesis of Human-Scene Interaction
In this thesis, we argue that the 3D scene is vital for understanding, reconstructing, and synthesizing human motion. We present several approaches which take the scene into consideration in reconstructing and synthesizing Human-Scene Interaction (HSI). We first observe that state-of-the-art pose estimation methods ignore the 3D scene and hence reconstruct poses that are inconsistent with the scene. We address this by proposing a pose estimation method that takes the 3D scene explicitly into account. We call our method PROX for Proximal Relationships with Object eXclusion. We leverage the data generated using PROX and build a method to automatically place 3D scans of people with clothing in scenes. The core novelty of our method is encoding the proximal relationships between the human and the scene in a novel HSI model, called POSA for Pose with prOximitieS and contActs. POSA is limited to static HSI, however. We propose a real-time method for synthesizing dynamic HSI, which we call SAMP for Scene-Aware Motion Prediction. SAMP enables virtual humans to navigate cluttered indoor scenes and naturally interact with objects. Data-driven kinematic models, like SAMP, can produce high-quality motion when applied in environments similar to those shown in the dataset. However, when applied to new scenarios, kinematic models can struggle to generate realistic behaviors that respect scene constraints. In contrast, we present InterPhys which uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a physical and life-like manner
FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
Speech-driven 3D facial animation synthesis has been a challenging task both
in industry and research. Recent methods mostly focus on deterministic deep
learning methods meaning that given a speech input, the output is always the
same. However, in reality, the non-verbal facial cues that reside throughout
the face are non-deterministic in nature. In addition, majority of the
approaches focus on 3D vertex based datasets and methods that are compatible
with existing facial animation pipelines with rigged characters is scarce. To
eliminate these issues, we present FaceDiffuser, a non-deterministic deep
learning model to generate speech-driven facial animations that is trained with
both 3D vertex and blendshape based datasets. Our method is based on the
diffusion technique and uses the pre-trained large speech representation model
HuBERT to encode the audio input. To the best of our knowledge, we are the
first to employ the diffusion method for the task of speech-driven 3D facial
animation synthesis. We have run extensive objective and subjective analyses
and show that our approach achieves better or comparable results in comparison
to the state-of-the-art methods. We also introduce a new in-house dataset that
is based on a blendshape based rigged character. We recommend watching the
accompanying supplementary video. The code and the dataset will be publicly
available.Comment: Pre-print of the paper accepted at ACM SIGGRAPH MIG 202
Enabling Neuromorphic Computing for Artificial Intelligence with Hardware-Software Co-Design
In the last decade, neuromorphic computing was rebirthed with the emergence of novel nano-devices and hardware-software co-design approaches. With the fast advancement in algorithms for today’s artificial intelligence (AI) applications, deep neural networks (DNNs) have become the mainstream technology. It has been a new research trend to enable neuromorphic designs for DNNs computing with high computing efficiency in speed and energy. In this chapter, we will summarize the recent advances in neuromorphic computing hardware and system designs with non-volatile resistive access memory (ReRAM) devices. More specifically, we will discuss the ReRAM-based neuromorphic computing hardware and system implementations, hardware-software co-design approaches for quantized and sparse DNNs, and architecture designs
Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as a Neurodevelopmental Cue
Non-nutritive sucking (NNS), which refers to the act of sucking on a
pacifier, finger, or similar object without nutrient intake, plays a crucial
role in assessing healthy early development. In the case of preterm infants,
NNS behavior is a key component in determining their readiness for feeding. In
older infants, the characteristics of NNS behavior offer valuable insights into
neural and motor development. Additionally, NNS activity has been proposed as a
potential safeguard against sudden infant death syndrome (SIDS). However, the
clinical application of NNS assessment is currently hindered by labor-intensive
and subjective finger-in-mouth evaluations. Consequently, researchers often
resort to expensive pressure transducers for objective NNS signal measurement.
To enhance the accessibility and reliability of NNS signal monitoring for both
clinicians and researchers, we introduce a vision-based algorithm designed for
non-contact detection of NNS activity using baby monitor footage in natural
settings. Our approach involves a comprehensive exploration of optical flow and
temporal convolutional networks, enabling the detection and amplification of
subtle infant-sucking signals. We successfully classify short video clips of
uniform length into NNS and non-NNS periods. Furthermore, we investigate manual
and learning-based techniques to piece together local classification results,
facilitating the segmentation of longer mixed-activity videos into NNS and
non-NNS segments of varying duration. Our research introduces two novel
datasets of annotated infant videos, including one sourced from our clinical
study featuring 19 infant subjects and 183 hours of overnight baby monitor
footage
Semantify: Simplifying the Control of 3D Morphable Models using CLIP
We present Semantify: a self-supervised method that utilizes the semantic
power of CLIP language-vision foundation model to simplify the control of 3D
morphable models. Given a parametric model, training data is created by
randomly sampling the model's parameters, creating various shapes and rendering
them. The similarity between the output images and a set of word descriptors is
calculated in CLIP's latent space. Our key idea is first to choose a small set
of semantically meaningful and disentangled descriptors that characterize the
3DMM, and then learn a non-linear mapping from scores across this set to the
parametric coefficients of the given 3DMM. The non-linear mapping is defined by
training a neural network without a human-in-the-loop. We present results on
numerous 3DMMs: body shape models, face shape and expression models, as well as
animal shapes. We demonstrate how our method defines a simple slider interface
for intuitive modeling, and show how the mapping can be used to instantly fit a
3D parametric body shape to in-the-wild images
PhoMoH: Implicit Photorealistic 3D Models of Human Heads
We present PhoMoH, a neural network methodology to construct generative
models of photo-realistic 3D geometry and appearance of human heads including
hair, beards, an oral cavity, and clothing. In contrast to prior work, PhoMoH
models the human head using neural fields, thus supporting complex topology.
Instead of learning a head model from scratch, we propose to augment an
existing expressive head model with new features. Concretely, we learn a highly
detailed geometry network layered on top of a mid-resolution head model
together with a detailed, local geometry-aware, and disentangled color field.
Our proposed architecture allows us to learn photo-realistic human head models
from relatively little data. The learned generative geometry and appearance
networks can be sampled individually and enable the creation of diverse and
realistic human heads. Extensive experiments validate our method qualitatively
and across different metrics.Comment: To be published at the International Conference on 3D Vision 202
Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation
Generative Neural Radiance Fields (GNeRF) based 3D-aware GANs have
demonstrated remarkable capabilities in generating high-quality images while
maintaining strong 3D consistency. Notably, significant advancements have been
made in the domain of face generation. However, most existing models prioritize
view consistency over disentanglement, resulting in limited semantic/attribute
control during generation. To address this limitation, we propose a conditional
GNeRF model incorporating specific attribute labels as input to enhance the
controllability and disentanglement abilities of 3D-aware generative models.
Our approach builds upon a pre-trained 3D-aware face model, and we introduce a
Training as Init and Optimizing for Tuning (TRIOT) method to train a
conditional normalized flow module to enable the facial attribute editing, then
optimize the latent vector to improve attribute-editing precision further. Our
extensive experiments demonstrate that our model produces high-quality edits
with superior view consistency while preserving non-target regions. Code is
available at https://github.com/zhangqianhui/TT-GNeRF.Comment: 13 page
- …