Image dehazing is a representative low-level vision task that estimates
latent haze-free images from hazy images. In recent years, convolutional neural
network-based methods have dominated image dehazing. However, vision
Transformers, which has recently made a breakthrough in high-level vision
tasks, has not brought new dimensions to image dehazing. We start with the
popular Swin Transformer and find that several of its key designs are
unsuitable for image dehazing. To this end, we propose DehazeFormer, which
consists of various improvements, such as the modified normalization layer,
activation function, and spatial information aggregation scheme. We train
multiple variants of DehazeFormer on various datasets to demonstrate its
effectiveness. Specifically, on the most frequently used SOTS indoor set, our
small model outperforms FFA-Net with only 25% #Param and 5% computational cost.
To the best of our knowledge, our large model is the first method with the PSNR
over 40 dB on the SOTS indoor set, dramatically outperforming the previous
state-of-the-art methods. We also collect a large-scale realistic remote
sensing dehazing dataset for evaluating the method's capability to remove
highly non-homogeneous haze