4,833 research outputs found
Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching
The deep multi-view stereo (MVS) and stereo matching approaches generally
construct 3D cost volumes to regularize and regress the output depth or
disparity. These methods are limited when high-resolution outputs are needed
since the memory and time costs grow cubically as the volume resolution
increases. In this paper, we propose a both memory and time efficient cost
volume formulation that is complementary to existing multi-view stereo and
stereo matching approaches based on 3D cost volumes. First, the proposed cost
volume is built upon a standard feature pyramid encoding geometry and context
at gradually finer scales. Then, we can narrow the depth (or disparity) range
of each stage by the depth (or disparity) map from the previous stage. With
gradually higher cost volume resolution and adaptive adjustment of depth (or
disparity) intervals, the output is recovered in a coarser to fine manner.
We apply the cascade cost volume to the representative MVS-Net, and obtain a
23.1% improvement on DTU benchmark (1st place), with 50.6% and 74.2% reduction
in GPU memory and run-time. It is also the state-of-the-art learning-based
method on Tanks and Temples benchmark. The statistics of accuracy, run-time and
GPU memory on other representative stereo CNNs also validate the effectiveness
of our proposed method.Comment: Accepted by CVPR2020 Ora
Self-Supervised Learning for Stereo Matching with Self-Improving Ability
Exiting deep-learning based dense stereo matching methods often rely on
ground-truth disparity maps as the training signals, which are however not
always available in many situations. In this paper, we design a simple
convolutional neural network architecture that is able to learn to compute
dense disparity maps directly from the stereo inputs. Training is performed in
an end-to-end fashion without the need of ground-truth disparity maps. The idea
is to use image warping error (instead of disparity-map residuals) as the loss
function to drive the learning process, aiming to find a depth-map that
minimizes the warping error. While this is a simple concept well-known in
stereo matching, to make it work in a deep-learning framework, many non-trivial
challenges must be overcome, and in this work we provide effective solutions.
Our network is self-adaptive to different unseen imageries as well as to
different camera settings. Experiments on KITTI and Middlebury stereo benchmark
datasets show that our method outperforms many state-of-the-art stereo matching
methods with a margin, and at the same time significantly faster.Comment: 13 pages, 11 figure
Learning monocular depth estimation infusing traditional stereo knowledge
Depth estimation from a single image represents a fascinating, yet
challenging problem with countless applications. Recent works proved that this
task could be learned without direct supervision from ground truth labels
leveraging image synthesis on sequences or stereo pairs. Focusing on this
second case, in this paper we leverage stereo matching in order to improve
monocular depth estimation. To this aim we propose monoResMatch, a novel deep
architecture designed to infer depth from a single input image by synthesizing
features from a different point of view, horizontally aligned with the input
image, performing stereo matching between the two cues. In contrast to previous
works sharing this rationale, our network is the first trained end-to-end from
scratch. Moreover, we show how obtaining proxy ground truth annotation through
traditional stereo algorithms, such as Semi-Global Matching, enables more
accurate monocular depth estimation still countering the need for expensive
depth labels by keeping a self-supervised approach. Exhaustive experimental
results prove how the synergy between i) the proposed monoResMatch architecture
and ii) proxy-supervision attains state-of-the-art for self-supervised
monocular depth estimation. The code is publicly available at
https://github.com/fabiotosi92/monoResMatch-Tensorflow.Comment: accepted at CVPR 2019. Code available at
https://github.com/fabiotosi92/monoResMatch-Tensorflo
CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching
Recently, the ever-increasing capacity of large-scale annotated datasets has
led to profound progress in stereo matching. However, most of these successes
are limited to a specific dataset and cannot generalize well to other datasets.
The main difficulties lie in the large domain differences and unbalanced
disparity distribution across a variety of datasets, which greatly limit the
real-world applicability of current deep stereo matching models. In this paper,
we propose CFNet, a Cascade and Fused cost volume based network to improve the
robustness of the stereo matching network. First, we propose a fused cost
volume representation to deal with the large domain difference. By fusing
multiple low-resolution dense cost volumes to enlarge the receptive field, we
can extract robust structural representations for initial disparity estimation.
Second, we propose a cascade cost volume representation to alleviate the
unbalanced disparity distribution. Specifically, we employ a variance-based
uncertainty estimation to adaptively adjust the next stage disparity search
space, in this way driving the network progressively prune out the space of
unlikely correspondences. By iteratively narrowing down the disparity search
space and improving the cost volume resolution, the disparity estimation is
gradually refined in a coarse-to-fine manner. When trained on the same training
images and evaluated on KITTI, ETH3D, and Middlebury datasets with the fixed
model parameters and hyperparameters, our proposed method achieves the
state-of-the-art overall performance and obtains the 1st place on the stereo
task of Robust Vision Challenge 2020. The code will be available at
https://github.com/gallenszl/CFNet.Comment: accepted by CVPR202
A Large RGB-D Dataset for Semi-supervised Monocular Depth Estimation
Current self-supervised methods for monocular depth estimation are largely
based on deeply nested convolutional networks that leverage stereo image pairs
or monocular sequences during a training phase. However, they often exhibit
inaccurate results around occluded regions and depth boundaries. In this paper,
we present a simple yet effective approach for monocular depth estimation using
stereo image pairs. The study aims to propose a student-teacher strategy in
which a shallow student network is trained with the auxiliary information
obtained from a deeper and more accurate teacher network. Specifically, we
first train the stereo teacher network by fully utilizing the binocular
perception of 3-D geometry and then use the depth predictions of the teacher
network to train the student network for monocular depth inference. This
enables us to exploit all available depth data from massive unlabeled stereo
pairs. We propose a strategy that involves the use of a data ensemble to merge
the multiple depth predictions of the teacher network to improve the training
samples by collecting non-trivial knowledge beyond a single prediction. To
refine the inaccurate depth estimation that is used when training the student
network, we further propose stereo confidence-guided regression loss that
handles the unreliable pseudo depth values in occlusion, texture-less region,
and repetitive pattern. To complement the existing dataset comprising outdoor
driving scenes, we built a novel large-scale dataset consisting of one million
outdoor stereo images taken using hand-held stereo cameras. Finally, we
demonstrate that the monocular depth estimation network provides feature
representations that are suitable for high-level vision tasks. The experimental
results for various outdoor scenarios demonstrate the effectiveness and
flexibility of our approach, which outperforms state-of-the-art approaches.Comment: https://dimlrgbd.github.io
Learning for Disparity Estimation through Feature Constancy
Stereo matching algorithms usually consist of four steps, including matching
cost calculation, matching cost aggregation, disparity calculation, and
disparity refinement. Existing CNN-based methods only adopt CNN to solve parts
of the four steps, or use different networks to deal with different steps,
making them difficult to obtain the overall optimal solution. In this paper, we
propose a network architecture to incorporate all steps of stereo matching. The
network consists of three parts. The first part calculates the multi-scale
shared features. The second part performs matching cost calculation, matching
cost aggregation and disparity calculation to estimate the initial disparity
using shared features. The initial disparity and the shared features are used
to calculate the feature constancy that measures correctness of the
correspondence between two input images. The initial disparity and the feature
constancy are then fed to a sub-network to refine the initial disparity. The
proposed method has been evaluated on the Scene Flow and KITTI datasets. It
achieves the state-of-the-art performance on the KITTI 2012 and KITTI 2015
benchmarks while maintaining a very fast running time.Comment: Accepted by CVPR 2018, 10 pages, 3 figure
Multi-Scale Cost Volumes Cascade Network for Stereo Matching
Stereo matching is essential for robot navigation. However, the accuracy of
current widely used traditional methods is low, while methods based on CNN need
expensive computational cost and running time. This is because different cost
volumes play a crucial role in balancing speed and accuracy. Thus we propose
MSCVNet, which combines traditional methods and neural networks to improve the
quality of cost volume. Concretely, our network first generates multiple 3D
cost volumes with different resolutions and then uses 2D convolutions to
construct a novel cascade hourglass network for cost aggregation. Meanwhile, we
design an algorithm to distinguish and calculate the loss for discontinuous
areas of disparity result. According to the KITTI official website, our network
is much faster than most top-performing methods (24 times than CSPN, 44 times
than GANet, etc.). Meanwhile, compared to traditional methods (SPS-St, SGM) and
other real-time stereo matching networks (Fast DS-CS, DispNetC, and RTSNet,
etc.), our network achieves a big improvement in accuracy, demonstrating the
feasibility and capability of the proposed method
PatchmatchNet: Learned Multi-View Patchmatch Stereo
We present PatchmatchNet, a novel and learnable cascade formulation of
Patchmatch for high-resolution multi-view stereo. With high computation speed
and low memory requirement, PatchmatchNet can process higher resolution imagery
and is more suited to run on resource limited devices than competitors that
employ 3D cost volume regularization. For the first time we introduce an
iterative multi-scale Patchmatch in an end-to-end trainable architecture and
improve the Patchmatch core algorithm with a novel and learned adaptive
propagation and evaluation scheme for each iteration. Extensive experiments
show a very competitive performance and generalization for our method on DTU,
Tanks & Temples and ETH3D, but at a significantly higher efficiency than all
existing top-performing models: at least two and a half times faster than
state-of-the-art methods with twice less memory usage
ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems
In this paper we present ActiveStereoNet, the first deep learning solution
for active stereo systems. Due to the lack of ground truth, our method is fully
self-supervised, yet it produces precise depth with a subpixel precision of
of a pixel; it does not suffer from the common over-smoothing issues;
it preserves the edges; and it explicitly handles occlusions. We introduce a
novel reconstruction loss that is more robust to noise and texture-less
patches, and is invariant to illumination changes. The proposed loss is
optimized using a window-based cost aggregation with an adaptive support weight
scheme. This cost aggregation is edge-preserving and smooths the loss function,
which is key to allow the network to reach compelling results. Finally we show
how the task of predicting invalid regions, such as occlusions, can be trained
end-to-end without ground-truth. This component is crucial to reduce blur and
particularly improves predictions along depth discontinuities. Extensive
quantitatively and qualitatively evaluations on real and synthetic data
demonstrate state of the art results in many challenging scenes.Comment: Accepted by ECCV2018, Oral Presentation, Main paper + Supplementary
Material
LiStereo: Generate Dense Depth Maps from LIDAR and Stereo Imagery
An accurate depth map of the environment is critical to the safe operation of
autonomous robots and vehicles. Currently, either light detection and ranging
(LIDAR) or stereo matching algorithms are used to acquire such depth
information. However, a high-resolution LIDAR is expensive and produces sparse
depth map at large range; stereo matching algorithms are able to generate
denser depth maps but are typically less accurate than LIDAR at long range.
This paper combines these approaches together to generate high-quality dense
depth maps. Unlike previous approaches that are trained using ground-truth
labels, the proposed model adopts a self-supervised training process.
Experiments show that the proposed method is able to generate high-quality
dense depth maps and performs robustly even with low-resolution inputs. This
shows the potential to reduce the cost by using LIDARs with lower resolution in
concert with stereo systems while maintaining high resolution.Comment: 14 pages, 3 figures, 5 table
- …