724 research outputs found
RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
Learning-based multi-view stereo (MVS) has by far centered around 3D
convolution on cost volumes. Due to the high computation and memory consumption
of 3D CNN, the resolution of output depth is often considerably limited.
Different from most existing works dedicated to adaptive refinement of cost
volumes, we opt to directly optimize the depth value along each camera ray,
mimicking the range finding of a laser scanner. This reduces the MVS problem to
ray-based depth optimization which is much more light-weight than full cost
volume optimization. In particular, we propose RayMVSNet which learns
sequential prediction of a 1D implicit field along each camera ray with the
zero-crossing point indicating scene depth. This sequential modeling, conducted
based on transformer features, essentially learns the epipolar line search in
traditional multi-view stereo. We devise a multi-task learning for better
optimization convergence and depth accuracy. We found the monotonicity property
of the SDFs along each ray greatly benefits the depth estimation. Our method
ranks top on both the DTU and the Tanks & Temples datasets over all previous
learning-based methods, achieving an overall reconstruction score of 0.33mm on
DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce
high-quality depth estimation and point cloud reconstruction in challenging
scenarios such as objects/scenes with non-textured surface, severe occlusion,
and highly varying depth range. Further, we propose RayMVSNet++ to enhance
contextual feature aggregation for each ray through designing an attentional
gating unit to select semantically relevant neighboring rays within the local
frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on
the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces
accurate results on the two subsets of textureless regions and large depth
variation.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv
admin note: substantial text overlap with arXiv:2204.0132
Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network
Accurately matching local features between a pair of images is a challenging
computer vision task. Previous studies typically use attention based graph
neural networks (GNNs) with fully-connected graphs over keypoints within/across
images for visual and geometric information reasoning. However, in the context
of feature matching, considerable keypoints are non-repeatable due to occlusion
and failure of the detector, and thus irrelevant for message passing. The
connectivity with non-repeatable keypoints not only introduces redundancy,
resulting in limited efficiency, but also interferes with the representation
aggregation process, leading to limited accuracy. Targeting towards high
accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN
architecture which bypasses non-repeatable keypoints and leverages matchable
ones to guide compact and meaningful message passing. More specifically, our
Bilateral Context-Aware Sampling Module first dynamically samples two small
sets of well-distributed keypoints with high matchability scores from the image
pair. Then, our Matchable Keypoint-Assisted Context Aggregation Module regards
sampled informative keypoints as message bottlenecks and thus constrains each
keypoint only to retrieve favorable contextual information from intra- and
inter- matchable keypoints, evading the interference of irrelevant and
redundant connectivity with non-repeatable ones. Furthermore, considering the
potential noise in initial keypoints and sampled matchable ones, the MKACA
module adopts a matchability-guided attentional aggregation operation for purer
data-dependent context propagation. By these means, we achieve the
state-of-the-art performance on relative camera estimation, fundamental matrix
estimation, and visual localization, while significantly reducing computational
and memory complexity compared to typical attentional GNNs
- …