We propose VisFusion, a visibility-aware online 3D scene reconstruction
approach from posed monocular videos. In particular, we aim to reconstruct the
scene from volumetric features. Unlike previous reconstruction methods which
aggregate features for each voxel from input views without considering its
visibility, we aim to improve the feature fusion by explicitly inferring its
visibility from a similarity matrix, computed from its projected features in
each image pair. Following previous works, our model is a coarse-to-fine
pipeline including a volume sparsification process. Different from their works
which sparsify voxels globally with a fixed occupancy threshold, we perform the
sparsification on a local feature volume along each visual ray to preserve at
least one voxel per ray for more fine details. The sparse local volume is then
fused with a global one for online reconstruction. We further propose to
predict TSDF in a coarse-to-fine manner by learning its residuals across scales
leading to better TSDF predictions. Experimental results on benchmarks show
that our method can achieve superior performance with more scene details. Code
is available at: https://github.com/huiyu-gao/VisFusionComment: CVPR 202