62 research outputs found
Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference
Deep learning has recently demonstrated its excellent performance for
multi-view stereo (MVS). However, one major limitation of current learned MVS
approaches is the scalability: the memory-consuming cost volume regularization
makes the learned MVS hard to be applied to high-resolution scenes. In this
paper, we introduce a scalable multi-view stereo framework based on the
recurrent neural network. Instead of regularizing the entire 3D cost volume in
one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet)
sequentially regularizes the 2D cost maps along the depth direction via the
gated recurrent unit (GRU). This reduces dramatically the memory consumption
and makes high-resolution reconstruction feasible. We first show the
state-of-the-art performance achieved by the proposed R-MVSNet on the recent
MVS benchmarks. Then, we further demonstrate the scalability of the proposed
method on several large-scale scenarios, where previous learned approaches
often fail due to the memory constraint. Code is available at
https://github.com/YoYo000/MVSNet.Comment: Accepted by CVPR201
ContextDesc: Local Descriptor Augmentation with Cross-Modality Context
Most existing studies on learning local features focus on the patch-based
descriptions of individual keypoints, whereas neglecting the spatial relations
established from their keypoint locations. In this paper, we go beyond the
local detail representation by introducing context awareness to augment
off-the-shelf local feature descriptors. Specifically, we propose a unified
learning framework that leverages and aggregates the cross-modality contextual
information, including (i) visual context from high-level image representation,
and (ii) geometric context from 2D keypoint distribution. Moreover, we propose
an effective N-pair loss that eschews the empirical hyper-parameter search and
improves the convergence. The proposed augmentation scheme is lightweight
compared with the raw local feature description, meanwhile improves remarkably
on several large-scale benchmarks with diversified scenes, which demonstrates
both strong practicality and generalization ability in geometric matching
applications.Comment: Accepted to CVPR 2019 (oral), supplementary materials included.
(https://github.com/lzx551402/contextdesc
Loss Functions for Multiset Prediction
We study the problem of multiset prediction. The goal of multiset prediction
is to train a predictor that maps an input to a multiset consisting of multiple
items. Unlike existing problems in supervised learning, such as classification,
ranking and sequence generation, there is no known order among items in a
target multiset, and each item in the multiset may appear more than once,
making this problem extremely challenging. In this paper, we propose a novel
multiset loss function by viewing this problem from the perspective of
sequential decision making. The proposed multiset loss function is empirically
evaluated on two families of datasets, one synthetic and the other real, with
varying levels of difficulty, against various baseline loss functions including
reinforcement learning, sequence, and aggregated distribution matching loss
functions. The experiments reveal the effectiveness of the proposed loss
function over the others.Comment: NIPS 201
Visibility-aware Multi-view Stereo Network
Learning-based multi-view stereo (MVS) methods have demonstrated promising
results. However, very few existing networks explicitly take the pixel-wise
visibility into consideration, resulting in erroneous cost aggregation from
occluded pixels. In this paper, we explicitly infer and integrate the
pixel-wise occlusion information in the MVS network via the matching
uncertainty estimation. The pair-wise uncertainty map is jointly inferred with
the pair-wise depth map, which is further used as weighting guidance during the
multi-view cost volume fusion. As such, the adverse influence of occluded
pixels is suppressed in the cost fusion. The proposed framework Vis-MVSNet
significantly improves depth accuracies in the scenes with severe occlusion.
Extensive experiments are performed on DTU, BlendedMVS, and Tanks and Temples
datasets to justify the effectiveness of the proposed framework.Comment: Accepted to BMVC 202
MVSNet: Depth Inference for Unstructured Multi-view Stereo
We present an end-to-end deep learning architecture for depth map inference
from multi-view images. In the network, we first extract deep visual image
features, and then build the 3D cost volume upon the reference camera frustum
via the differentiable homography warping. Next, we apply 3D convolutions to
regularize and regress the initial depth map, which is then refined with the
reference image to generate the final output. Our framework flexibly adapts
arbitrary N-view inputs using a variance-based cost metric that maps multiple
features into one cost feature. The proposed MVSNet is demonstrated on the
large-scale indoor DTU dataset. With simple post-processing, our method not
only significantly outperforms previous state-of-the-arts, but also is several
times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks
and Temples dataset, where our method ranks first before April 18, 2018 without
any fine-tuning, showing the strong generalization ability of MVSNet.Comment: Accepted to European Conference on Computer Vision (ECCV 2018
Learning Stereo Matchability in Disparity Regression Networks
Learning-based stereo matching has recently achieved promising results, yet
still suffers difficulties in establishing reliable matches in weakly matchable
regions that are textureless, non-Lambertian, or occluded. In this paper, we
address this challenge by proposing a stereo matching network that considers
pixel-wise matchability. Specifically, the network jointly regresses disparity
and matchability maps from 3D probability volume through expectation and
entropy operations. Next, a learned attenuation is applied as the robust loss
function to alleviate the influence of weakly matchable pixels in the training.
Finally, a matchability-aware disparity refinement is introduced to improve the
depth inference in weakly matchable regions. The proposed deep stereo
matchability (DSM) framework can improve the matching result or accelerate the
computation while still guaranteeing the quality. Moreover, the DSM framework
is portable to many recent stereo networks. Extensive experiments are conducted
on Scene Flow and KITTI stereo datasets to demonstrate the effectiveness of the
proposed framework over the state-of-the-art learning-based stereo methods.Comment: Accepted to ICPR 202
BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks
While deep learning has recently achieved great success on multi-view stereo
(MVS), limited training data makes the trained model hard to be generalized to
unseen scenarios. Compared with other computer vision tasks, it is rather
difficult to collect a large-scale MVS dataset as it requires expensive active
scanners and labor-intensive process to obtain ground truth 3D structures. In
this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide
sufficient training ground truth for learning-based MVS. To create the dataset,
we apply a 3D reconstruction pipeline to recover high-quality textured meshes
from images of well-selected scenes. Then, we render these mesh models to color
images and depth maps. To introduce the ambient lighting information during
training, the rendered color images are further blended with the input images
to generate the training input. Our dataset contains over 17k high-resolution
images covering a variety of scenes, including cities, architectures,
sculptures and small objects. Extensive experiments demonstrate that BlendedMVS
endows the trained model with significantly better generalization ability
compared with other MVS datasets. The dataset and pretrained models are
available at \url{https://github.com/YoYo000/BlendedMVS}.Comment: Accepted to CVPR202
KFNet: Learning Temporal Camera Relocalization using Kalman Filtering
Temporal camera relocalization estimates the pose with respect to each video
frame in sequence, as opposed to one-shot relocalization which focuses on a
still image. Even though the time dependency has been taken into account,
current temporal relocalization methods still generally underperform the
state-of-the-art one-shot approaches in terms of accuracy. In this work, we
improve the temporal relocalization method by using a network architecture that
incorporates Kalman filtering (KFNet) for online camera relocalization. In
particular, KFNet extends the scene coordinate regression problem to the time
domain in order to recursively establish 2D and 3D correspondences for the pose
determination. The network architecture design and the loss formulation are
based on Kalman filtering in the context of Bayesian learning. Extensive
experiments on multiple relocalization benchmarks demonstrate the high accuracy
of KFNet at the top of both one-shot and temporal relocalization approaches.
Our codes are released at https://github.com/zlthinker/KFNet.Comment: An oral paper of CVPR 202
Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency
The self-supervised learning of depth and pose from monocular sequences
provides an attractive solution by using the photometric consistency of nearby
frames as it depends much less on the ground-truth data. In this paper, we
address the issue when previous assumptions of the self-supervised approaches
are violated due to the dynamic nature of real-world scenes. Different from
handling the noise as uncertainty, our key idea is to incorporate more robust
geometric quantities and enforce internal consistency in the temporal image
sequence. As demonstrated on commonly used benchmark datasets, the proposed
method substantially improves the state-of-the-art methods on both depth and
relative pose estimation for monocular image sequences, without adding
inference overhead.Comment: International Conference on Computer Vision (ICCV) Workshop 201
ASLFeat: Learning Local Features of Accurate Shape and Localization
This work focuses on mitigating two limitations in the joint learning of
local feature detectors and descriptors. First, the ability to estimate the
local shape (scale, orientation, etc.) of feature points is often neglected
during dense feature extraction, while the shape-awareness is crucial to
acquire stronger geometric invariance. Second, the localization accuracy of
detected keypoints is not sufficient to reliably recover camera geometry, which
has become the bottleneck in tasks such as 3D reconstruction. In this paper, we
present ASLFeat, with three light-weight yet effective modifications to
mitigate above issues. First, we resort to deformable convolutional networks to
densely estimate and apply local transformation. Second, we take advantage of
the inherent feature hierarchy to restore spatial resolution and low-level
details for accurate keypoint localization. Finally, we use a peakiness
measurement to relate feature responses and derive more indicative detection
scores. The effect of each modification is thoroughly studied, and the
evaluation is extensively conducted across a variety of practical scenarios.
State-of-the-art results are reported that demonstrate the superiority of our
methods.Comment: Accepted to CVPR 2020, supplementary materials included, code
availabl
- …