Multi-modal fusion is increasingly being used for autonomous driving tasks,
as images from different modalities provide unique information for feature
extraction. However, the existing two-stream networks are only fused at a
specific network layer, which requires a lot of manual attempts to set up. As
the CNN goes deeper, the two modal features become more and more advanced and
abstract, and the fusion occurs at the feature level with a large gap, which
can easily hurt the performance. In this study, we propose a novel fusion
architecture called skip-cross networks (SkipcrossNets), which combines
adaptively LiDAR point clouds and camera images without being bound to a
certain fusion epoch. Specifically, skip-cross connects each layer to each
layer in a feed-forward manner, and for each layer, the feature maps of all
previous layers are used as input and its own feature maps are used as input to
all subsequent layers for the other modality, enhancing feature propagation and
multi-modal features fusion. This strategy facilitates selection of the most
similar feature layers from two data pipelines, providing a complementary
effect for sparse point cloud features during fusion processes. The network is
also divided into several blocks to reduce the complexity of feature fusion and
the number of model parameters. The advantages of skip-cross fusion were
demonstrated through application to the KITTI and A2D2 datasets, achieving a
MaxF score of 96.85% on KITTI and an F1 score of 84.84% on A2D2. The model
parameters required only 2.33 MB of memory at a speed of 68.24 FPS, which could
be viable for mobile terminals and embedded devices