Search CORE

113 research outputs found

Representing Volumetric Videos as Dynamic MLP Maps

Author: Bao Hujun
Peng Sida
Shuai Qing
Yan Yunzhi
Zhou Xiaowei
Publication venue
Publication date: 13/04/2023
Field of study

This paper introduces a novel representation of volumetric videos for real-time view synthesis of dynamic scenes. Recent advances in neural scene representations demonstrate their remarkable capability to model and render complex static scenes, but extending them to represent dynamic scenes is not straightforward due to their slow rendering speed or high storage cost. To solve this problem, our key idea is to represent the radiance field of each frame as a set of shallow MLP networks whose parameters are stored in 2D grids, called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all frames. Representing 3D scenes with shallow MLPs significantly improves the rendering speed, while dynamically predicting MLP parameters with a shared 2D CNN instead of explicitly storing them leads to low storage cost. Experiments show that the proposed approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering with a speed of 41.7 fps for

512 \times 512

images on an RTX 3090 GPU. The code is available at https://zju3dv.github.io/mlp_maps/.Comment: Accepted to CVPR 2023. The first two authors contributed equally to this paper. Project page: https://zju3dv.github.io/mlp_maps

arXiv.org e-Print Archive

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Author: He Xingyi
Peng Sida
Tan Dongli
Wang Yifan
Zhou Xiaowei
Publication venue
Publication date: 11/03/2024
Field of study

We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is

\sim 2.5\times

faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr.Comment: CVPR 2024; Project page: https://zju3dv.github.io/efficientloft

arXiv.org e-Print Archive

Learning Human Mesh Recovery in 3D Scenes

Author: Bao Hujun
Cen Zhi
Peng Sida
Shen Zehong
Shuai Qing
Zhou Xiaowei
Publication venue
Publication date: 06/06/2023
Field of study

We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed.Comment: Accepted to CVPR 2023. Project page: https://zju3dv.github.io/sahmr

arXiv.org e-Print Archive

Ponder: Point Cloud Pre-training via Neural Rendering

Author: He Tong
Huang Di
Ouyang Wanli
Peng Sida
Yang Honghui
Zhou Xiaowei
Publication venue
Publication date: 26/10/2023
Field of study

We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.Comment: Project page: https://dihuang.me/ponder

arXiv.org e-Print Archive

Efficient Neural Radiance Fields for Interactive Free-viewpoint Video

Author: Bao Hujun
Lin Haotong
Peng Sida
Shuai Qing
Xu Zhen
Yan Yunzhi
Zhou Xiaowei
Publication venue
Publication date: 27/11/2022
Field of study

This paper aims to tackle the challenge of efficiently producing interactive free-viewpoint videos. Some recent works equip neural radiance fields with image encoders, enabling them to generalize across scenes. When processing dynamic scenes, they can simply treat each video frame as an individual scene and perform novel view synthesis to generate free-viewpoint videos. However, their rendering process is slow and cannot support interactive applications. A major factor is that they sample lots of points in empty space when inferring radiance fields. We propose a novel scene representation, called ENeRF, for the fast creation of interactive free-viewpoint videos. Specifically, given multi-view images at one frame, we first build the cascade cost volume to predict the coarse geometry of the scene. The coarse geometry allows us to sample few points near the scene surface, thereby significantly improving the rendering speed. This process is fully differentiable, enabling us to jointly learn the depth prediction and radiance field networks from RGB images. Experiments on multiple benchmarks show that our approach exhibits competitive performance while being at least 60 times faster than previous generalizable radiance field methods.Comment: SIGGRAPH Asia 2022; Project page: https://zju3dv.github.io/enerf

arXiv.org e-Print Archive

Detector-Free Structure from Motion

Author: Bao Hujun
He Xingyi
Huang Qixing
Peng Sida
Sun Jiaming
Wang Yifan
Zhou Xiaowei
Publication venue
Publication date: 27/06/2023
Field of study

We propose a new structure-from-motion framework to recover accurate camera poses and point clouds from unordered images. Traditional SfM systems typically rely on the successful detection of repeatable keypoints across multiple views as the first step, which is difficult for texture-poor scenes, and poor keypoint detection may break down the whole SfM system. We propose a new detector-free SfM framework to draw benefits from the recent success of detector-free matchers to avoid the early determination of keypoints, while solving the multi-view inconsistency issue of detector-free matchers. Specifically, our framework first reconstructs a coarse SfM model from quantized detector-free matches. Then, it refines the model by a novel iterative refinement pipeline, which iterates between an attention-based multi-view matching module to refine feature tracks and a geometry refinement module to improve the reconstruction accuracy. Experiments demonstrate that the proposed framework outperforms existing detector-based SfM systems on common benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate the capability of our framework to reconstruct texture-poor scenes. Based on this framework, we take

\textit{first place}

in Image Matching Challenge 2023.Comment: Project page: https://zju3dv.github.io/DetectorFreeSfM

arXiv.org e-Print Archive

Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields

Author: Bao Hujun
Chen Tianrun
Peng Sida
ShenTu Yinji
Shuai Qing
Yu Kaicheng
Zhang Shangzhan
Zhou Xiaowei
Publication venue
Publication date: 24/07/2023
Field of study

Recently, the editing of neural radiance fields (NeRFs) has gained considerable attention, but most prior works focus on static scenes while research on the appearance editing of dynamic scenes is relatively lacking. In this paper, we propose a novel framework to edit the local appearance of dynamic NeRFs by manipulating pixels in a single frame of training video. Specifically, to locally edit the appearance of dynamic NeRFs while preserving unedited regions, we introduce a local surface representation of the edited region, which can be inserted into and rendered along with the original NeRF and warped to arbitrary other frames through a learned invertible motion representation network. By employing our method, users without professional expertise can easily add desired content to the appearance of a dynamic scene. We extensively evaluate our approach on various scenes and show that our approach achieves spatially and temporally consistent editing results. Notably, our approach is versatile and applicable to different variants of dynamic NeRF representations.Comment: project page: https://dyn-e.github.io

arXiv.org e-Print Archive