Position emission tomography (PET) is widely used in clinics and research due
to its quantitative merits and high sensitivity, but suffers from low
signal-to-noise ratio (SNR). Recently convolutional neural networks (CNNs) have
been widely used to improve PET image quality. Though successful and efficient
in local feature extraction, CNN cannot capture long-range dependencies well
due to its limited receptive field. Global multi-head self-attention (MSA) is a
popular approach to capture long-range information. However, the calculation of
global MSA for 3D images has high computational costs. In this work, we
proposed an efficient spatial and channel-wise encoder-decoder transformer,
Spach Transformer, that can leverage spatial and channel information based on
local and global MSAs. Experiments based on datasets of different PET tracers,
i.e., 18F-FDG, 18F-ACBC, 18F-DCFPyL, and 68Ga-DOTATATE,
were conducted to evaluate the proposed framework. Quantitative results show
that the proposed Spach Transformer can achieve better performance than other
reference methods.Comment: 10 page