3,052 research outputs found
SkipcrossNets: Adaptive Skip-cross Fusion for Road Detection
Multi-modal fusion is increasingly being used for autonomous driving tasks,
as images from different modalities provide unique information for feature
extraction. However, the existing two-stream networks are only fused at a
specific network layer, which requires a lot of manual attempts to set up. As
the CNN goes deeper, the two modal features become more and more advanced and
abstract, and the fusion occurs at the feature level with a large gap, which
can easily hurt the performance. In this study, we propose a novel fusion
architecture called skip-cross networks (SkipcrossNets), which combines
adaptively LiDAR point clouds and camera images without being bound to a
certain fusion epoch. Specifically, skip-cross connects each layer to each
layer in a feed-forward manner, and for each layer, the feature maps of all
previous layers are used as input and its own feature maps are used as input to
all subsequent layers for the other modality, enhancing feature propagation and
multi-modal features fusion. This strategy facilitates selection of the most
similar feature layers from two data pipelines, providing a complementary
effect for sparse point cloud features during fusion processes. The network is
also divided into several blocks to reduce the complexity of feature fusion and
the number of model parameters. The advantages of skip-cross fusion were
demonstrated through application to the KITTI and A2D2 datasets, achieving a
MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model
parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could
be viable for mobile terminals and embedded devices
Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Self-driving cars need to understand 3D scenes efficiently and accurately in
order to drive safely. Given the limited hardware resources, existing 3D
perception models are not able to recognize small instances (e.g., pedestrians,
cyclists) very well due to the low-resolution voxelization and aggressive
downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv),
a lightweight 3D module that equips the vanilla Sparse Convolution with the
high-resolution point-based branch. With negligible overhead, this point-based
branch is able to preserve the fine details even from large outdoor scenes. To
explore the spectrum of efficient 3D models, we first define a flexible
architecture design space based on SPVConv, and we then present 3D Neural
Architecture Search (3D-NAS) to search the optimal network architecture over
this diverse design space efficiently and effectively. Experimental results
validate that the resulting SPVNAS model is fast and accurate: it outperforms
the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive
SemanticKITTI leaderboard. It also achieves 8x computation reduction and 3x
measured speedup over MinkowskiNet with higher accuracy. Finally, we transfer
our method to 3D object detection, and it achieves consistent improvements over
the one-stage detection baseline on KITTI.Comment: ECCV 2020. The first two authors contributed equally to this work.
Project page: http://spvnas.mit.edu
RDFC-GAN: RGB-Depth Fusion CycleGAN for Indoor Depth Completion
The raw depth image captured by indoor depth sensors usually has an extensive
range of missing depth values due to inherent limitations such as the inability
to perceive transparent objects and the limited distance range. The incomplete
depth map with missing values burdens many downstream vision tasks, and a
rising number of depth completion methods have been proposed to alleviate this
issue. While most existing methods can generate accurate dense depth maps from
sparse and uniformly sampled depth maps, they are not suitable for
complementing large contiguous regions of missing depth values, which is common
and critical in images captured in indoor environments. To overcome these
challenges, we design a novel two-branch end-to-end fusion network named
RDFC-GAN, which takes a pair of RGB and incomplete depth images as input to
predict a dense and completed depth map. The first branch employs an
encoder-decoder structure, by adhering to the Manhattan world assumption and
utilizing normal maps from RGB-D information as guidance, to regress the local
dense depth values from the raw depth map. In the other branch, we propose an
RGB-depth fusion CycleGAN to transfer the RGB image to the fine-grained
textured depth map. We adopt adaptive fusion modules named W-AdaIN to propagate
the features across the two branches, and we append a confidence fusion head to
fuse the two outputs of the branches for the final depth map. Extensive
experiments on NYU-Depth V2 and SUN RGB-D demonstrate that our proposed method
clearly improves the depth completion performance, especially in a more
realistic setting of indoor environments, with the help of our proposed pseudo
depth maps in training.Comment: Haowen Wang and Zhengping Che are with equal contributions. Under
review. An earlier version has been accepted by CVPR 2022 (arXiv:2203.10856
- …