107 research outputs found
DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
In this paper, we present the decomposed triplane-hash neural radiance fields
(DT-NeRF), a framework that significantly improves the photorealistic rendering
of talking faces and achieves state-of-the-art results on key evaluation
datasets. Our architecture decomposes the facial region into two specialized
triplanes: one specialized for representing the mouth, and the other for the
broader facial features. We introduce audio features as residual terms and
integrate them as query vectors into our model through an audio-mouth-face
transformer. Additionally, our method leverages the capabilities of Neural
Radiance Fields (NeRF) to enrich the volumetric representation of the entire
face through additive volumetric rendering techniques. Comprehensive
experimental evaluations corroborate the effectiveness and superiority of our
proposed approach.Comment: 5 pages, 5 figures. Submitted to ICASSP 202
Deep Planar Parallax for Monocular Depth Estimation
Recent research has highlighted the utility of Planar Parallax Geometry in
monocular depth estimation. However, its potential has yet to be fully realized
because networks rely heavily on appearance for depth prediction. Our in-depth
analysis reveals that utilizing flow-pretrain can optimize the network's usage
of consecutive frame modeling, leading to substantial performance enhancement.
Additionally, we propose Planar Position Embedding (PPE) to handle dynamic
objects that defy static scene assumptions and to tackle slope variations that
are challenging to differentiate. Comprehensive experiments on autonomous
driving datasets, namely KITTI and the Waymo Open Dataset (WOD), prove that our
Planar Parallax Network (PPNet) significantly surpasses existing learning-based
methods in performance
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
The Transformer architecture model, based on self-attention and multi-head
attention, has achieved remarkable success in offline end-to-end Automatic
Speech Recognition (ASR). However, self-attention and multi-head attention
cannot be easily applied for streaming or online ASR. For self-attention in
Transformer ASR, the softmax normalization function-based attention mechanism
makes it impossible to highlight important speech information. For multi-head
attention in Transformer ASR, it is not easy to model monotonic alignments in
different heads. To overcome these two limits, we integrate sparse attention
and monotonic attention into Transformer-based ASR. The sparse mechanism
introduces a learned sparsity scheme to enable each self-attention structure to
fit the corresponding head better. The monotonic attention deploys
regularization to prune redundant heads for the multi-head attention structure.
The experiments show that our method can effectively improve the attention
mechanism on widely used benchmarks of speech recognition.Comment: Accepted to DSAA 202
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
Whole-body mesh recovery aims to estimate the 3D human body, face, and hands
parameters from a single image. It is challenging to perform this task with a
single network due to resolution issues, i.e., the face and hands are usually
located in extremely small regions. Existing works usually detect hands and
faces, enlarge their resolution to feed in a specific network to predict the
parameter, and finally fuse the results. While this copy-paste pipeline can
capture the fine-grained details of the face and hands, the connections between
different parts cannot be easily recovered in late fusion, leading to
implausible 3D rotation and unnatural pose. In this work, we propose a
one-stage pipeline for expressive whole-body mesh recovery, named OSX, without
separate networks for each part. Specifically, we design a Component Aware
Transformer (CAT) composed of a global body encoder and a local face/hand
decoder. The encoder predicts the body parameters and provides a high-quality
feature map for the decoder, which performs a feature-level upsample-crop
scheme to extract high-resolution part-specific features and adopt
keypoint-guided deformable attention to estimate hand and face precisely. The
whole pipeline is simple yet effective without any manual post-processing and
naturally avoids implausible prediction. Comprehensive experiments demonstrate
the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset
(UBody) with high-quality 2D and 3D whole-body annotations. It contains persons
with partially visible bodies in diverse real-life scenarios to bridge the gap
between the basic task and downstream applications.Comment: Accepted to CVPR2023; Top-1 on AGORA SMPLX benchmark; Project Page:
https://osx-ubody.github.io
- …