183 research outputs found
StereoVoxelNet: Real-Time Obstacle Detection Based on Occupancy Voxels from a Stereo Camera Using Deep Neural Networks
Obstacle detection is a safety-critical problem in robot navigation, where
stereo matching is a popular vision-based approach. While deep neural networks
have shown impressive results in computer vision, most of the previous obstacle
detection works only leverage traditional stereo matching techniques to meet
the computational constraints for real-time feedback. This paper proposes a
computationally efficient method that leverages a deep neural network to detect
occupancy from stereo images directly. Instead of learning the point cloud
correspondence from the stereo data, our approach extracts the compact obstacle
distribution based on volumetric representations. In addition, we prune the
computation of safety irrelevant spaces in a coarse-to-fine manner based on
octrees generated by the decoder. As a result, we achieve real-time performance
on the onboard computer (NVIDIA Jetson TX2). Our approach detects obstacles
accurately in the range of 32 meters and achieves better IoU (Intersection over
Union) and CD (Chamfer Distance) scores with only 2% of the computation cost of
the state-of-the-art stereo model. Furthermore, we validate our method's
robustness and real-world feasibility through autonomous navigation experiments
with a real robot. Hence, our work contributes toward closing the gap between
the stereo-based system in robot perception and state-of-the-art stereo models
in computer vision. To counter the scarcity of high-quality real-world indoor
stereo datasets, we collect a 1.36 hours stereo dataset with a Jackal robot
which is used to fine-tune our model. The dataset, the code, and more
visualizations are available at https://lhy.xyz/stereovoxelnet
WHU-Stereo: A Challenging Benchmark for Stereo Matching of High-Resolution Satellite Images
Stereo matching of high-resolution satellite images (HRSI) is still a
fundamental but challenging task in the field of photogrammetry and remote
sensing. Recently, deep learning (DL) methods, especially convolutional neural
networks (CNNs), have demonstrated tremendous potential for stereo matching on
public benchmark datasets. However, datasets for stereo matching of satellite
images are scarce. To facilitate further research, this paper creates and
publishes a challenging dataset, termed WHU-Stereo, for stereo matching DL
network training and testing. This dataset is created by using airborne LiDAR
point clouds and high-resolution stereo imageries taken from the Chinese
GaoFen-7 satellite (GF-7). The WHU-Stereo dataset contains more than 1700
epipolar rectified image pairs, which cover six areas in China and includes
various kinds of landscapes. We have assessed the accuracy of ground-truth
disparity maps, and it is proved that our dataset achieves comparable precision
compared with existing state-of-the-art stereo matching datasets. To verify its
feasibility, in experiments, the hand-crafted SGM stereo matching algorithm and
recent deep learning networks have been tested on the WHU-Stereo dataset.
Experimental results show that deep learning networks can be well trained and
achieves higher performance than hand-crafted SGM algorithm, and the dataset
has great potential in remote sensing application. The WHU-Stereo dataset can
serve as a challenging benchmark for stereo matching of high-resolution
satellite images, and performance evaluation of deep learning models. Our
dataset is available at https://github.com/Sheng029/WHU-Stere
Self-Supervised Intensity-Event Stereo Matching
Event cameras are novel bio-inspired vision sensors that output pixel-level
intensity changes in microsecond accuracy with a high dynamic range and low
power consumption. Despite these advantages, event cameras cannot be directly
applied to computational imaging tasks due to the inability to obtain
high-quality intensity and events simultaneously. This paper aims to connect a
standalone event camera and a modern intensity camera so that the applications
can take advantage of both two sensors. We establish this connection through a
multi-modal stereo matching task. We first convert events to a reconstructed
image and extend the existing stereo networks to this multi-modality condition.
We propose a self-supervised method to train the multi-modal stereo network
without using ground truth disparity data. The structure loss calculated on
image gradients is used to enable self-supervised learning on such multi-modal
data. Exploiting the internal stereo constraint between views with different
modalities, we introduce general stereo loss functions, including disparity
cross-consistency loss and internal disparity loss, leading to improved
performance and robustness compared to existing approaches. The experiments
demonstrate the effectiveness of the proposed method, especially the proposed
general stereo loss functions, on both synthetic and real datasets. At last, we
shed light on employing the aligned events and intensity images in downstream
tasks, e.g., video interpolation application.Comment: This paper has been accepted by the Journal of Imaging Science &
Technolog
The Monocular Depth Estimation Challenge
This paper summarizes the results of the first Monocular Depth Estimation
Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress
of self-supervised monocular depth estimation on the challenging SYNS-Patches
dataset. The challenge was organized on CodaLab and received submissions from 4
valid teams. Participants were provided a devkit containing updated reference
implementations for 16 State-of-the-Art algorithms and 4 novel techniques. The
threshold for acceptance for novel techniques was to outperform every one of
the 16 SotA baselines. All participants outperformed the baseline in
traditional metrics such as MAE or AbsRel. However, pointcloud reconstruction
metrics were challenging to improve upon. We found predictions were
characterized by interpolation artefacts at object boundaries and errors in
relative object positioning. We hope this challenge is a valuable contribution
to the community and encourage authors to participate in future editions.Comment: WACV-Workshops 202
RomniStereo: Recurrent Omnidirectional Stereo Matching
Omnidirectional stereo matching (OSM) is an essential and reliable means for
depth sensing. However, following earlier works on conventional
stereo matching, prior state-of-the-art (SOTA) methods rely on a 3D
encoder-decoder block to regularize the cost volume, causing the whole system
complicated and sub-optimal results. Recently, the Recurrent All-pairs Field
Transforms (RAFT) based approach employs the recurrent update in 2D and has
efficiently improved image-matching tasks, ie, optical flow, and stereo
matching. To bridge the gap between OSM and RAFT, we mainly propose an opposite
adaptive weighting scheme to seamlessly transform the outputs of spherical
sweeping of OSM into the required inputs for the recurrent update, thus
creating a recurrent omnidirectional stereo matching (RomniStereo) algorithm.
Furthermore, we introduce two techniques, ie, grid embedding and adaptive
context feature generation, which also contribute to RomniStereo's performance.
Our best model improves the average MAE metric by 40.7\% over the previous SOTA
baseline across five datasets. When visualizing the results, our models
demonstrate clear advantages on both synthetic and realistic examples. The code
is available at \url{https://github.com/HalleyJiang/RomniStereo}.Comment: accepted by IEEE RA-L, https://github.com/HalleyJiang/RomniStere
CVRecon: Rethinking 3D Geometric Feature Learning For Neural Reconstruction
Recent advances in neural reconstruction using posed image sequences have
made remarkable progress. However, due to the lack of depth information,
existing volumetric-based techniques simply duplicate 2D image features of the
object surface along the entire camera ray. We contend this duplication
introduces noise in empty and occluded spaces, posing challenges for producing
high-quality 3D geometry. Drawing inspiration from traditional multi-view
stereo methods, we propose an end-to-end 3D neural reconstruction framework
CVRecon, designed to exploit the rich geometric embedding in the cost volumes
to facilitate 3D geometric feature learning. Furthermore, we present
Ray-contextual Compensated Cost Volume (RCCV), a novel 3D geometric feature
representation that encodes view-dependent information with improved integrity
and robustness. Through comprehensive experiments, we demonstrate that our
approach significantly improves the reconstruction quality in various metrics
and recovers clear fine details of the 3D geometries. Our extensive ablation
studies provide insights into the development of effective 3D geometric feature
learning schemes. Project page: https://cvrecon.ziyue.cool
TemporalStereo: Efficient Spatial-Temporal Stereo Matching Network
We present TemporalStereo, a coarse-to-fine based online stereo matching
network which is highly efficient, and able to effectively exploit the past
geometry and context information to boost the matching accuracy. Our network
leverages sparse cost volume and proves to be effective when a single stereo
pair is given, however, its peculiar ability to use spatio-temporal information
across frames allows TemporalStereo to alleviate problems such as occlusions
and reflective regions while enjoying high efficiency also in the case of
stereo sequences. Notably our model trained, once with stereo videos, can run
in both single-pair and temporal ways seamlessly. Experiments show that our
network relying on camera motion is even robust to dynamic objects when running
on videos. We validate TemporalStereo through extensive experiments on
synthetic (SceneFlow, TartanAir) and real (KITTI 2012, KITTI 2015) datasets.
Detailed results show that our model achieves state-of-the-art performance on
any of these datasets. Code is available at
\url{https://github.com/youmi-zym/TemporalStereo.git}
Self-supervised monocular image depth learning and confidence estimation.
We present a novel self-supervised framework for monocular image depth learning and confidence estimation. Our framework reduces the amount of ground truth annotation data required for training Convolutional Neural Networks (CNNs), which is often a challenging problem for the fast deployment of CNNs in many computer vision tasks. Our DepthNet adopts a novel fully differential patch-based cost function through the Zero-Mean Normalized Cross-Correlation (ZNCC) to take multi-scale patches as matching and learning strategies. This approach greatly increases the accuracy and robustness of the depth learning. Whilst the proposed patch-based cost function naturally provides a 0-to-1 confidence, it is then used to self-supervise the training of a parallel network for confidence map learning and estimation by exploiting the fact that ZNCC is a normalised measure of similarity which can be approximated as the confidence of the depth estimation. Therefore, the proposed corresponding confidence map learning and estimation operate in a self-supervised manner and is a parallel network to the DepthNet. Evaluation on the KITTI depth prediction evaluation dataset and Make3D dataset show that our method outperforms the state-of-the-art results
- …