The hybrid architecture of convolutional neural networks (CNNs) and
Transformer are very popular for medical image segmentation. However, it
suffers from two challenges. First, although a CNNs branch can capture the
local image features using vanilla convolution, it cannot achieve adaptive
feature learning. Second, although a Transformer branch can capture the global
features, it ignores the channel and cross-dimensional self-attention,
resulting in a low segmentation accuracy on complex-content images. To address
these challenges, we propose a novel hybrid architecture of convolutional
neural networks hand in hand with vision Transformers (CiT-Net) for medical
image segmentation. Our network has two advantages. First, we design a dynamic
deformable convolution and apply it to the CNNs branch, which overcomes the
weak feature extraction ability due to fixed-size convolution kernels and the
stiff design of sharing kernel parameters among different inputs. Second, we
design a shifted-window adaptive complementary attention module and a compact
convolutional projection. We apply them to the Transformer branch to learn the
cross-dimensional long-term dependency for medical images. Experimental results
show that our CiT-Net provides better medical image segmentation results than
popular SOTA methods. Besides, our CiT-Net requires lower parameters and less
computational costs and does not rely on pre-training. The code is publicly
available at https://github.com/SR0920/CiT-Net