113 research outputs found
Representing Volumetric Videos as Dynamic MLP Maps
This paper introduces a novel representation of volumetric videos for
real-time view synthesis of dynamic scenes. Recent advances in neural scene
representations demonstrate their remarkable capability to model and render
complex static scenes, but extending them to represent dynamic scenes is not
straightforward due to their slow rendering speed or high storage cost. To
solve this problem, our key idea is to represent the radiance field of each
frame as a set of shallow MLP networks whose parameters are stored in 2D grids,
called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all
frames. Representing 3D scenes with shallow MLPs significantly improves the
rendering speed, while dynamically predicting MLP parameters with a shared 2D
CNN instead of explicitly storing them leads to low storage cost. Experiments
show that the proposed approach achieves state-of-the-art rendering quality on
the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering
with a speed of 41.7 fps for images on an RTX 3090 GPU. The
code is available at https://zju3dv.github.io/mlp_maps/.Comment: Accepted to CVPR 2023. The first two authors contributed equally to
this paper. Project page: https://zju3dv.github.io/mlp_maps
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed
We present a novel method for efficiently producing semi-dense matches across
images. Previous detector-free matcher LoFTR has shown remarkable matching
capability in handling large-viewpoint change and texture-poor scenarios but
suffers from low efficiency. We revisit its design choices and derive multiple
improvements for both efficiency and accuracy. One key observation is that
performing the transformer over the entire feature map is redundant due to
shared local information, therefore we propose an aggregated attention
mechanism with adaptive token selection for efficiency. Furthermore, we find
spatial variance exists in LoFTR's fine correlation module, which is adverse to
matching accuracy. A novel two-stage correlation layer is proposed to achieve
accurate subpixel correspondences for accuracy improvement. Our efficiency
optimized model is faster than LoFTR which can even surpass
state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue.
Moreover, extensive experiments show that our method can achieve higher
accuracy compared with competitive semi-dense matchers, with considerable
efficiency benefits. This opens up exciting prospects for large-scale or
latency-sensitive applications such as image retrieval and 3D reconstruction.
Project page: https://zju3dv.github.io/efficientloftr.Comment: CVPR 2024; Project page: https://zju3dv.github.io/efficientloft
Learning Human Mesh Recovery in 3D Scenes
We present a novel method for recovering the absolute pose and shape of a
human in a pre-scanned scene given a single image. Unlike previous methods that
perform sceneaware mesh optimization, we propose to first estimate absolute
position and dense scene contacts with a sparse 3D CNN, and later enhance a
pretrained human mesh recovery network by cross-attention with the derived 3D
scene cues. Joint learning on images and scene geometry enables our method to
reduce the ambiguity caused by depth and occlusion, resulting in more
reasonable global postures and contacts. Encoding scene-aware cues in the
network also allows the proposed method to be optimization-free, and opens up
the opportunity for real-time applications. The experiments show that the
proposed network is capable of recovering accurate and physically-plausible
meshes by a single forward pass and outperforms state-of-the-art methods in
terms of both accuracy and speed.Comment: Accepted to CVPR 2023. Project page: https://zju3dv.github.io/sahmr
Ponder: Point Cloud Pre-training via Neural Rendering
We propose a novel approach to self-supervised learning of point cloud
representations by differentiable neural rendering. Motivated by the fact that
informative point cloud features should be able to encode rich geometry and
appearance cues and render realistic images, we train a point-cloud encoder
within a devised point-based neural renderer by comparing the rendered images
with real images on massive RGB-D data. The learned point-cloud encoder can be
easily integrated into various downstream tasks, including not only high-level
tasks like 3D detection and segmentation, but low-level tasks like 3D
reconstruction and image synthesis. Extensive experiments on various tasks
demonstrate the superiority of our approach compared to existing pre-training
methods.Comment: Project page: https://dihuang.me/ponder
Efficient Neural Radiance Fields for Interactive Free-viewpoint Video
This paper aims to tackle the challenge of efficiently producing interactive
free-viewpoint videos. Some recent works equip neural radiance fields with
image encoders, enabling them to generalize across scenes. When processing
dynamic scenes, they can simply treat each video frame as an individual scene
and perform novel view synthesis to generate free-viewpoint videos. However,
their rendering process is slow and cannot support interactive applications. A
major factor is that they sample lots of points in empty space when inferring
radiance fields. We propose a novel scene representation, called ENeRF, for the
fast creation of interactive free-viewpoint videos. Specifically, given
multi-view images at one frame, we first build the cascade cost volume to
predict the coarse geometry of the scene. The coarse geometry allows us to
sample few points near the scene surface, thereby significantly improving the
rendering speed. This process is fully differentiable, enabling us to jointly
learn the depth prediction and radiance field networks from RGB images.
Experiments on multiple benchmarks show that our approach exhibits competitive
performance while being at least 60 times faster than previous generalizable
radiance field methods.Comment: SIGGRAPH Asia 2022; Project page: https://zju3dv.github.io/enerf
Detector-Free Structure from Motion
We propose a new structure-from-motion framework to recover accurate camera
poses and point clouds from unordered images. Traditional SfM systems typically
rely on the successful detection of repeatable keypoints across multiple views
as the first step, which is difficult for texture-poor scenes, and poor
keypoint detection may break down the whole SfM system. We propose a new
detector-free SfM framework to draw benefits from the recent success of
detector-free matchers to avoid the early determination of keypoints, while
solving the multi-view inconsistency issue of detector-free matchers.
Specifically, our framework first reconstructs a coarse SfM model from
quantized detector-free matches. Then, it refines the model by a novel
iterative refinement pipeline, which iterates between an attention-based
multi-view matching module to refine feature tracks and a geometry refinement
module to improve the reconstruction accuracy. Experiments demonstrate that the
proposed framework outperforms existing detector-based SfM systems on common
benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate
the capability of our framework to reconstruct texture-poor scenes. Based on
this framework, we take in Image Matching Challenge
2023.Comment: Project page: https://zju3dv.github.io/DetectorFreeSfM
Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields
Recently, the editing of neural radiance fields (NeRFs) has gained
considerable attention, but most prior works focus on static scenes while
research on the appearance editing of dynamic scenes is relatively lacking. In
this paper, we propose a novel framework to edit the local appearance of
dynamic NeRFs by manipulating pixels in a single frame of training video.
Specifically, to locally edit the appearance of dynamic NeRFs while preserving
unedited regions, we introduce a local surface representation of the edited
region, which can be inserted into and rendered along with the original NeRF
and warped to arbitrary other frames through a learned invertible motion
representation network. By employing our method, users without professional
expertise can easily add desired content to the appearance of a dynamic scene.
We extensively evaluate our approach on various scenes and show that our
approach achieves spatially and temporally consistent editing results. Notably,
our approach is versatile and applicable to different variants of dynamic NeRF
representations.Comment: project page: https://dyn-e.github.io
- …