Recent advances in Vision Transformers (ViTs) have significantly enhanced
medical image segmentation by facilitating the learning of global
relationships. However, these methods face a notable challenge in capturing
diverse local and global long-range sequential feature representations,
particularly evident in whole-body CT (WBCT) scans. To overcome this
limitation, we introduce Swin Soft Mixture Transformer (Swin SMT), a novel
architecture based on Swin UNETR. This model incorporates a Soft
Mixture-of-Experts (Soft MoE) to effectively handle complex and diverse
long-range dependencies. The use of Soft MoE allows for scaling up model
parameters maintaining a balance between computational complexity and
segmentation performance in both training and inference modes. We evaluate Swin
SMT on the publicly available TotalSegmentator-V2 dataset, which includes 117
major anatomical structures in WBCT images. Comprehensive experimental results
demonstrate that Swin SMT outperforms several state-of-the-art methods in 3D
anatomical structure segmentation, achieving an average Dice Similarity
Coefficient of 85.09%. The code and pre-trained weights of Swin SMT are
publicly available at https://github.com/MI2DataLab/SwinSMT.Comment: Accepted to MICCAI 2024 (early accept