198 research outputs found
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
In this work, we present a multimodal solution to the problem of 4D face
reconstruction from monocular videos. 3D face reconstruction from 2D images is
an under-constrained problem due to the ambiguity of depth. State-of-the-art
methods try to solve this problem by leveraging visual information from a
single image or video, whereas 3D mesh animation approaches rely more on audio.
However, in most cases (e.g. AR/VR applications), videos include both visual
and speech information. We propose AVFace that incorporates both modalities and
accurately reconstructs the 4D facial and lip motion of any speaker, without
requiring any 3D ground truth for training. A coarse stage estimates the
per-frame parameters of a 3D morphable model, followed by a lip refinement, and
then a fine stage recovers facial geometric details. Due to the temporal audio
and video information captured by transformer-based modules, our method is
robust in cases when either modality is insufficient (e.g. face occlusions).
Extensive qualitative and quantitative evaluation demonstrates the superiority
of our method over the current state-of-the-art
S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces
Neural rendering of implicit surfaces performs well in 3D vision
applications. However, it requires dense input views as supervision. When only
sparse input images are available, output quality drops significantly due to
the shape-radiance ambiguity problem. We note that this ambiguity can be
constrained when a 3D point is visible in multiple views, as is the case in
multi-view stereo (MVS). We thus propose to regularize neural rendering
optimization with an MVS solution. The use of an MVS probability volume and a
generalized cross entropy loss leads to a noise-tolerant optimization process.
In addition, neural rendering provides global consistency constraints that
guide the MVS depth hypothesis sampling and thus improves MVS performance.
Given only three sparse input views, experiments show that our method not only
outperforms generic neural rendering models by a large margin but also
significantly increases the reconstruction quality of MVS models. Project
webpage: https://hao-yu-wu.github.io/s-volsdf/
Learning Probabilistic Topological Representations Using Discrete Morse Theory
Accurate delineation of fine-scale structures is a very important yet
challenging problem. Existing methods use topological information as an
additional training loss, but are ultimately making pixel-wise predictions. In
this paper, we propose the first deep learning based method to learn
topological/structural representations. We use discrete Morse theory and
persistent homology to construct an one-parameter family of structures as the
topological/structural representation space. Furthermore, we learn a
probabilistic model that can perform inference tasks in such a
topological/structural representation space. Our method generates true
structures rather than pixel-maps, leading to better topological integrity in
automatic segmentation tasks. It also facilitates semi-automatic interactive
annotation/proofreading via the sampling of structures and structure-aware
uncertainty.Comment: 16 pages, 11 figure
Conditional Generation from Unconditional Diffusion Models using Denoiser Representations
Denoising diffusion models have gained popularity as a generative modeling
technique for producing high-quality and diverse images. Applying these models
to downstream tasks requires conditioning, which can take the form of text,
class labels, or other forms of guidance. However, providing conditioning
information to these models can be challenging, particularly when annotations
are scarce or imprecise. In this paper, we propose adapting pre-trained
unconditional diffusion models to new conditions using the learned internal
representations of the denoiser network. We demonstrate the effectiveness of
our approach on various conditional generation tasks, including
attribute-conditioned generation and mask-conditioned generation. Additionally,
we show that augmenting the Tiny ImageNet training set with synthetic images
generated by our approach improves the classification accuracy of ResNet
baselines by up to 8%. Our approach provides a powerful and flexible way to
adapt diffusion models to new conditions and generate high-quality augmented
data for various conditional generation tasks
- …