Integration transformer for ground-based cloud image segmentation

Abstract

Recently, convolutional neural networks (CNNs) dominate the ground-based cloud image segmentation task, but disregard the learning of long-range dependencies due to the limited size of filters. Although Transformer-based methods could overcome this limitation, they only learn long-range dependencies at a single scale, hence failing to capture multiscale information of cloud images. The multiscale information is beneficial to ground-based cloud image segmentation, because the features from small scales tend to extract detailed information, while features from large scales have the ability to learn global information. In this article, we propose a novel deep network named Integration Transformer (InTransformer), which builds long-range dependencies from different scales. To this end, we propose the hybrid multihead transformer block (HMTB) to learn multiscale long-range dependencies and hybridize CNNs and HMTB as the encoder at different scales. The proposed InTransformer hybridizes CNNs and Transformer as the encoder to extract multiscale representations, which learns both local information and long-range dependencies with different scales. Meanwhile, in order to fuse the patch tokens with different scales, we propose a mutual cross-attention module (MCAM) for the decoder of InTransformer which could adequately interact multiscale patch tokens in a bidirectional way. We have conducted a series of experiments on the large ground-based cloud detection database TJNU Large Scale Cloud Detection Database (TLCDD) and Singapore Whole sky IMaging SEGmentation Database (SWIMSEG). The experimental results show that the performance of our method outperforms other methods, proving the effectiveness of the proposed InTransformer

    Similar works