183 research outputs found
Knowledge NeRF: Few-shot Novel View Synthesis for Dynamic Articulated Objects
We present Knowledge NeRF to synthesize novel views for dynamic scenes.
Reconstructing dynamic 3D scenes from few sparse views and rendering them from
arbitrary perspectives is a challenging problem with applications in various
domains. Previous dynamic NeRF methods learn the deformation of articulated
objects from monocular videos. However, qualities of their reconstructed scenes
are limited. To clearly reconstruct dynamic scenes, we propose a new framework
by considering two frames at a time.We pretrain a NeRF model for an articulated
object.When articulated objects moves, Knowledge NeRF learns to generate novel
views at the new state by incorporating past knowledge in the pretrained NeRF
model with minimal observations in the present state. We propose a projection
module to adapt NeRF for dynamic scenes, learning the correspondence between
pretrained knowledge base and current states. Experimental results demonstrate
the effectiveness of our method in reconstructing dynamic 3D scenes with 5
input images in one state. Knowledge NeRF is a new pipeline and promising
solution for novel view synthesis in dynamic articulated objects. The data and
implementation are publicly available at
https://github.com/RussRobin/Knowledge_NeRF
CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention
While features of different scales are perceptually important to visual
inputs, existing vision transformers do not yet take advantage of them
explicitly. To this end, we first propose a cross-scale vision transformer,
CrossFormer. It introduces a cross-scale embedding layer (CEL) and a long-short
distance attention (LSDA). On the one hand, CEL blends each token with multiple
patches of different scales, providing the self-attention module itself with
cross-scale features. On the other hand, LSDA splits the self-attention module
into a short-distance one and a long-distance counterpart, which not only
reduces the computational burden but also keeps both small-scale and
large-scale features in the tokens. Moreover, through experiments on
CrossFormer, we observe another two issues that affect vision transformers'
performance, i.e., the enlarging self-attention maps and amplitude explosion.
Thus, we further propose a progressive group size (PGS) paradigm and an
amplitude cooling layer (ACL) to alleviate the two issues, respectively. The
CrossFormer incorporating with PGS and ACL is called CrossFormer++. Extensive
experiments show that CrossFormer++ outperforms the other vision transformers
on image classification, object detection, instance segmentation, and semantic
segmentation tasks. The code will be available at:
https://github.com/cheerss/CrossFormer.Comment: 16 pages, 7 figure
SparseDC: Depth Completion from sparse and non-uniform inputs
We propose SparseDC, a model for Depth Completion of Sparse and non-uniform
depth inputs. Unlike previous methods focusing on completing fixed
distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64
lines), SparseDC is specifically designed to handle depth maps with poor
quality in real usage. The key contributions of SparseDC are two-fold. First,
we design a simple strategy, called SFFM, to improve the robustness under
sparse input by explicitly filling the unstable depth features with stable
image features. Second, we propose a two-branch feature embedder to predict
both the precise local geometry of regions with available depth values and
accurate structures in regions with no depth. The key of the embedder is an
uncertainty-based fusion module called UFFM to balance the local and long-term
information extracted by CNNs and ViTs. Extensive indoor and outdoor
experiments demonstrate the robustness of our framework when facing sparse and
non-uniform input depths. The pre-trained model and code are available at
https://github.com/WHU-USI3DV/SparseDC
- …