In the field of medical CT image processing, convolutional neural networks
(CNNs) have been the dominant technique.Encoder-decoder CNNs utilise locality
for efficiency, but they cannot simulate distant pixel interactions
properly.Recent research indicates that self-attention or transformer layers
can be stacked to efficiently learn long-range dependencies.By constructing and
processing picture patches as embeddings, transformers have been applied to
computer vision applications. However, transformer-based architectures lack
global semantic information interaction and require a large-scale training
dataset, making it challenging to train with small data samples. In order to
solve these challenges, we present a hierarchical contextattention transformer
network (MHITNet) that combines the multi-scale, transformer, and hierarchical
context extraction modules in skip-connections. The multi-scale module captures
deeper CT semantic information, enabling transformers to encode feature maps of
tokenized picture patches from various CNN stages as input attention sequences
more effectively. The hierarchical context attention module augments global
data and reweights pixels to capture semantic context.Extensive trials on three
datasets show that the proposed MHITNet beats current best practise