195 research outputs found
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
In this work, we present a multimodal solution to the problem of 4D face
reconstruction from monocular videos. 3D face reconstruction from 2D images is
an under-constrained problem due to the ambiguity of depth. State-of-the-art
methods try to solve this problem by leveraging visual information from a
single image or video, whereas 3D mesh animation approaches rely more on audio.
However, in most cases (e.g. AR/VR applications), videos include both visual
and speech information. We propose AVFace that incorporates both modalities and
accurately reconstructs the 4D facial and lip motion of any speaker, without
requiring any 3D ground truth for training. A coarse stage estimates the
per-frame parameters of a 3D morphable model, followed by a lip refinement, and
then a fine stage recovers facial geometric details. Due to the temporal audio
and video information captured by transformer-based modules, our method is
robust in cases when either modality is insufficient (e.g. face occlusions).
Extensive qualitative and quantitative evaluation demonstrates the superiority
of our method over the current state-of-the-art
S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
Neural rendering of implicit surfaces performs well in 3D vision
applications. However, it requires dense input views as supervision. When only
sparse input images are available, output quality drops significantly due to
the shape-radiance ambiguity problem. We note that this ambiguity can be
constrained when a 3D point is visible in multiple views, as is the case in
multi-view stereo (MVS). We thus propose to regularize neural rendering
optimization with an MVS solution. The use of an MVS probability volume and a
generalized cross entropy loss leads to a noise-tolerant optimization process.
In addition, neural rendering provides global consistency constraints that
guide the MVS depth hypothesis sampling and thus improves MVS performance.
Given only three sparse input views, experiments show that our method not only
outperforms generic neural rendering models by a large margin but also
significantly increases the reconstruction quality of MVS models. Project
webpage: https://hao-yu-wu.github.io/s-volsdf/
Learning Probabilistic Topological Representations Using Discrete Morse Theory
Accurate delineation of fine-scale structures is a very important yet
challenging problem. Existing methods use topological information as an
additional training loss, but are ultimately making pixel-wise predictions. In
this paper, we propose the first deep learning based method to learn
topological/structural representations. We use discrete Morse theory and
persistent homology to construct an one-parameter family of structures as the
topological/structural representation space. Furthermore, we learn a
probabilistic model that can perform inference tasks in such a
topological/structural representation space. Our method generates true
structures rather than pixel-maps, leading to better topological integrity in
automatic segmentation tasks. It also facilitates semi-automatic interactive
annotation/proofreading via the sampling of structures and structure-aware
uncertainty.Comment: 16 pages, 11 figure
Patch-level Gaze Distribution Prediction for Gaze Following
Gaze following aims to predict where a person is looking in a scene, by
predicting the target location, or indicating that the target is located
outside the image. Recent works detect the gaze target by training a heatmap
regression task with a pixel-wise mean-square error (MSE) loss, while
formulating the in/out prediction task as a binary classification task. This
training formulation puts a strict, pixel-level constraint in higher resolution
on the single annotation available in training, and does not consider
annotation variance and the correlation between the two subtasks. To address
these issues, we introduce the patch distribution prediction (PDP) method. We
replace the in/out prediction branch in previous models with the PDP branch, by
predicting a patch-level gaze distribution that also considers the outside
cases. Experiments show that our model regularizes the MSE loss by predicting
better heatmap distributions on images with larger annotation variances,
meanwhile bridging the gap between the target prediction and in/out prediction
subtasks, showing a significant improvement in performance on both subtasks on
public gaze following datasets.Comment: Accepted to WACV 202
- …