61 research outputs found
High-Quality 3D Face Reconstruction with Affine Convolutional Networks
Recent works based on convolutional encoder-decoder architecture and 3DMM
parameterization have shown great potential for canonical view reconstruction
from a single input image. Conventional CNN architectures benefit from
exploiting the spatial correspondence between the input and output pixels.
However, in 3D face reconstruction, the spatial misalignment between the input
image (e.g. face) and the canonical/UV output makes the feature
encoding-decoding process quite challenging. In this paper, to tackle this
problem, we propose a new network architecture, namely the Affine Convolution
Networks, which enables CNN based approaches to handle spatially
non-corresponding input and output images and maintain high-fidelity quality
output at the same time. In our method, an affine transformation matrix is
learned from the affine convolution layer for each spatial location of the
feature maps. In addition, we represent 3D human heads in UV space with
multiple components, including diffuse maps for texture representation,
position maps for geometry representation, and light maps for recovering more
complex lighting conditions in the real world. All the components can be
trained without any manual annotations. Our method is parametric-free and can
generate high-quality UV maps at resolution of 512 x 512 pixels, while previous
approaches normally generate 256 x 256 pixels or smaller. Our code will be
released once the paper got accepted.Comment: 9 pages, 11 figure
Geometric Pseudospectral Method on SE(3) for Rigid-Body Dynamics with Application to Aircraft
General pseudospectral method is extended to the special Euclidean group SE(3) by virtue of equivariant map for rigid-body dynamics of the aircraft. On SE(3), a complete left invariant rigid-body dynamics model of the aircraft in body-fixed frame is established, including configuration model and velocity model. For the left invariance of the configuration model, equivalent Lie algebra equation corresponding to the configuration equation is derived based on the left-trivialized tangent of local coordinate map, and the top eight orders truncated Magnus series expansion with its coefficients of the solution of the equivalent Lie algebra equation are given. A numerical method called geometric pseudospectral method is developed, which, respectively, computes configurations and velocities at the collocation points and the endpoint based on two different collocation strategies. Through numerical tests on a free-floating rigid-body dynamics compared with several same order classical methods in Euclidean space and Lie group, it is found that the proposed method has higher accuracy, satisfying computational efficiency, stable Lie group structural conservativeness. Finally, how to apply the previous discretization scheme to rigid-body dynamics simulation and control of the aircraft is illustrated
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate
sounding sources by predicting pixel-wise maps. Previous methods assume that
each sound component in an audio signal always has a visual counterpart in the
image. However, this assumption overlooks that off-screen sounds and background
noise often contaminate the audio recordings in real-world scenarios. They
impose significant challenges on building a consistent semantic mapping between
audio and visual signals for AVS models and thus impede precise sound
localization. In this work, we propose a two-stage bootstrapping audio-visual
segmentation framework by incorporating multi-modal foundation knowledge. In a
nutshell, our BAVS is designed to eliminate the interference of background
noise or off-screen sounds in segmentation by establishing the audio-visual
correspondences in an explicit manner. In the first stage, we employ a
segmentation model to localize potential sounding objects from visual data
without being affected by contaminated audio signals. Meanwhile, we also
utilize a foundation audio classification model to discern audio semantics.
Considering the audio tags provided by the audio foundation model are noisy,
associating object masks with audio tags is not trivial. Thus, in the second
stage, we develop an audio-visual semantic integration strategy (AVIS) to
localize the authentic-sounding objects. Here, we construct an audio-visual
tree based on the hierarchical correspondence between sounds and object
categories. We then examine the label concurrency between the localized objects
and classified audio tags by tracing the audio-visual tree. With AVIS, we can
effectively segment real-sounding objects. Extensive experiments demonstrate
the superiority of our method on AVS datasets, particularly in scenarios
involving background noise. Our project website is
https://yenanliu.github.io/AVSS.github.io/
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Generating vivid and diverse 3D co-speech gestures is crucial for various
applications in animating virtual avatars. While most existing methods can
generate gestures from audio directly, they usually overlook that emotion is
one of the key factors of authentic co-speech gesture generation. In this work,
we propose EmotionGesture, a novel framework for synthesizing vivid and diverse
emotional co-speech 3D gestures from audio. Considering emotion is often
entangled with the rhythmic beat in speech audio, we first develop an
Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features
as well as model their correlation via a transcript-based visual-rhythm
alignment. Then, we propose an initial pose based Spatial-Temporal Prompter
(STP) to generate future gestures from the given initial poses. STP effectively
models the spatial-temporal correlations between the initial poses and the
future gestures, thus producing the spatial-temporal coherent pose prompt. Once
we obtain pose prompts, emotion, and audio beat features, we will generate 3D
co-speech gestures through a transformer architecture. However, considering the
poses of existing datasets often contain jittering effects, this would lead to
generating unstable gestures. To address this issue, we propose an effective
objective function, dubbed Motion-Smooth Loss. Specifically, we model motion
offset to compensate for jittering ground-truth by forcing gestures to be
smooth. Last, we present an emotion-conditioned VAE to sample emotion features,
enabling us to generate diverse emotional results. Extensive experiments
demonstrate that our framework outperforms the state-of-the-art, achieving
vivid and diverse emotional co-speech 3D gestures.Comment: Under revie
NeFII: Inverse Rendering for Reflectance Decomposition with Near-Field Indirect Illumination
Inverse rendering methods aim to estimate geometry, materials and
illumination from multi-view RGB images. In order to achieve better
decomposition, recent approaches attempt to model indirect illuminations
reflected from different materials via Spherical Gaussians (SG), which,
however, tends to blur the high-frequency reflection details. In this paper, we
propose an end-to-end inverse rendering pipeline that decomposes materials and
illumination from multi-view images, while considering near-field indirect
illumination. In a nutshell, we introduce the Monte Carlo sampling based path
tracing and cache the indirect illumination as neural radiance, enabling a
physics-faithful and easy-to-optimize inverse rendering method. To enhance
efficiency and practicality, we leverage SG to represent the smooth environment
illuminations and apply importance sampling techniques. To supervise indirect
illuminations from unobserved directions, we develop a novel radiance
consistency constraint between implicit neural radiance and path tracing
results of unobserved rays along with the joint optimization of materials and
illuminations, thus significantly improving the decomposition performance.
Extensive experiments demonstrate that our method outperforms the
state-of-the-art on multiple synthetic and real datasets, especially in terms
of inter-reflection decomposition.Comment: Accepted in CVPR 202
Uncertainty-aware Gait Recognition via Learning from Dirichlet Distribution-based Evidence
Existing gait recognition frameworks retrieve an identity in the gallery
based on the distance between a probe sample and the identities in the gallery.
However, existing methods often neglect that the gallery may not contain
identities corresponding to the probes, leading to recognition errors rather
than raising an alarm. In this paper, we introduce a novel uncertainty-aware
gait recognition method that models the uncertainty of identification based on
learned evidence. Specifically, we treat our recognition model as an evidence
collector to gather evidence from input samples and parameterize a Dirichlet
distribution over the evidence. The Dirichlet distribution essentially
represents the density of the probability assigned to the input samples. We
utilize the distribution to evaluate the resultant uncertainty of each probe
sample and then determine whether a probe has a counterpart in the gallery or
not. To the best of our knowledge, our method is the first attempt to tackle
gait recognition with uncertainty modelling. Moreover, our uncertain modeling
significantly improves the robustness against out-of-distribution (OOD)
queries. Extensive experiments demonstrate that our method achieves
state-of-the-art performance on datasets with OOD queries, and can also
generalize well to other identity-retrieval tasks. Importantly, our method
outperforms the state-of-the-art by a large margin of 51.26% when the OOD query
rate is around 50% on OUMVLP
DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition
Gait recognition is a biometric technology that recognizes the identity of
humans through their walking patterns. Compared with other biometric
technologies, gait recognition is more difficult to disguise and can be applied
to the condition of long-distance without the cooperation of subjects. Thus, it
has unique potential and wide application for crime prevention and social
security. At present, most gait recognition methods directly extract features
from the video frames to establish representations. However, these
architectures learn representations from different features equally but do not
pay enough attention to dynamic features, which refers to a representation of
dynamic parts of silhouettes over time (e.g. legs). Since dynamic parts of the
human body are more informative than other parts (e.g. bags) during walking, in
this paper, we propose a novel and high-performance framework named DyGait.
This is the first framework on gait recognition that is designed to focus on
the extraction of dynamic features. Specifically, to take full advantage of the
dynamic information, we propose a Dynamic Augmentation Module (DAM), which can
automatically establish spatial-temporal feature representations of the dynamic
parts of the human body. The experimental results show that our DyGait network
outperforms other state-of-the-art gait recognition methods. It achieves an
average Rank-1 accuracy of 71.4% on the GREW dataset, 66.3% on the Gait3D
dataset, 98.4% on the CASIA-B dataset and 98.3% on the OU-MVLP dataset
Text-Guided 3D Face Synthesis -- From Generation to Editing
Text-guided 3D face synthesis has achieved remarkable results by leveraging
text-to-image (T2I) diffusion models. However, most existing works focus solely
on the direct generation, ignoring the editing, restricting them from
synthesizing customized 3D faces through iterative adjustments. In this paper,
we propose a unified text-guided framework from face generation to editing. In
the generation stage, we propose a geometry-texture decoupled generation to
mitigate the loss of geometric details caused by coupling. Besides, decoupling
enables us to utilize the generated geometry as a condition for texture
generation, yielding highly geometry-texture aligned results. We further employ
a fine-tuned texture diffusion model to enhance texture quality in both RGB and
YUV space. In the editing stage, we first employ a pre-trained diffusion model
to update facial geometry or texture based on the texts. To enable sequential
editing, we introduce a UV domain consistency preservation regularization,
preventing unintentional changes to irrelevant facial attributes. Besides, we
propose a self-guided consistency weight strategy to improve editing efficacy
while preserving consistency. Through comprehensive experiments, we showcase
our method's superiority in face synthesis. Project page:
https://faceg2e.github.io/
- …