Scene parsing is a great challenge for real-time semantic segmentation.
Although traditional semantic segmentation networks have made remarkable
leap-forwards in semantic accuracy, the performance of inference speed is
unsatisfactory. Meanwhile, this progress is achieved with fairly large networks
and powerful computational resources. However, it is difficult to run extremely
large models on edge computing devices with limited computing power, which
poses a huge challenge to the real-time semantic segmentation tasks. In this
paper, we present the Cross-CBAM network, a novel lightweight network for
real-time semantic segmentation. Specifically, a Squeeze-and-Excitation Atrous
Spatial Pyramid Pooling Module(SE-ASPP) is proposed to get variable
field-of-view and multiscale information. And we propose a Cross Convolutional
Block Attention Module(CCBAM), in which a cross-multiply operation is employed
in the CCBAM module to make high-level semantic information guide low-level
detail information. Different from previous work, these works use attention to
focus on the desired information in the backbone. CCBAM uses cross-attention
for feature fusion in the FPN structure. Extensive experiments on the
Cityscapes dataset and Camvid dataset demonstrate the effectiveness of the
proposed Cross-CBAM model by achieving a promising trade-off between
segmentation accuracy and inference speed. On the Cityscapes test set, we
achieve 73.4% mIoU with a speed of 240.9FPS and 77.2% mIoU with a speed of
88.6FPS on NVIDIA GTX 1080Ti