In oncology research, accurate 3D segmentation of lesions from CT scans is
essential for the modeling of lesion growth kinetics. However, following the
RECIST criteria, radiologists routinely only delineate each lesion on the axial
slice showing the largest transverse area, and delineate a small number of
lesions in 3D for research purposes. As a result, we have plenty of unlabeled
3D volumes and labeled 2D images, and scarce labeled 3D volumes, which makes
training a deep-learning 3D segmentation model a challenging task. In this
work, we propose a novel model, denoted a multi-dimension unified Swin
transformer (MDU-ST), for 3D lesion segmentation. The MDU-ST consists of a
Shifted-window transformer (Swin-transformer) encoder and a convolutional
neural network (CNN) decoder, allowing it to adapt to 2D and 3D inputs and
learn the corresponding semantic information in the same encoder. Based on this
model, we introduce a three-stage framework: 1) leveraging large amount of
unlabeled 3D lesion volumes through self-supervised pretext tasks to learn the
underlying pattern of lesion anatomy in the Swin-transformer encoder; 2)
fine-tune the Swin-transformer encoder to perform 2D lesion segmentation with
2D RECIST slices to learn slice-level segmentation information; 3) further
fine-tune the Swin-transformer encoder to perform 3D lesion segmentation with
labeled 3D volumes. The network's performance is evaluated by the Dice
similarity coefficient (DSC) and Hausdorff distance (HD) using an internal 3D
lesion dataset with 593 lesions extracted from multiple anatomical locations.
The proposed MDU-ST demonstrates significant improvement over the competing
models. The proposed method can be used to conduct automated 3D lesion
segmentation to assist radiomics and tumor growth modeling studies. This paper
has been accepted by the IEEE International Symposium on Biomedical Imaging
(ISBI) 2023