Multi-modal imaging is a key healthcare technology that is often
underutilized due to costs associated with multiple separate scans. This
limitation yields the need for synthesis of unacquired modalities from the
subset of available modalities. In recent years, generative adversarial network
(GAN) models with superior depiction of structural details have been
established as state-of-the-art in numerous medical image synthesis tasks. GANs
are characteristically based on convolutional neural network (CNN) backbones
that perform local processing with compact filters. This inductive bias in turn
compromises learning of contextual features. Here, we propose a novel
generative adversarial approach for medical image synthesis, ResViT, to combine
local precision of convolution operators with contextual sensitivity of vision
transformers. ResViT employs a central bottleneck comprising novel aggregated
residual transformer (ART) blocks that synergistically combine convolutional
and transformer modules. Comprehensive demonstrations are performed for
synthesizing missing sequences in multi-contrast MRI, and CT images from MRI.
Our results indicate superiority of ResViT against competing methods in terms
of qualitative observations and quantitative metrics