ResViT: Residual vision transformers for multi-modal medical image synthesis

Abstract

Multi-modal imaging is a key healthcare technology that is often underutilized due to costs associated with multiple separate scans. This limitation yields the need for synthesis of unacquired modalities from the subset of available modalities. In recent years, generative adversarial network (GAN) models with superior depiction of structural details have been established as state-of-the-art in numerous medical image synthesis tasks. GANs are characteristically based on convolutional neural network (CNN) backbones that perform local processing with compact filters. This inductive bias in turn compromises learning of contextual features. Here, we propose a novel generative adversarial approach for medical image synthesis, ResViT, to combine local precision of convolution operators with contextual sensitivity of vision transformers. ResViT employs a central bottleneck comprising novel aggregated residual transformer (ART) blocks that synergistically combine convolutional and transformer modules. Comprehensive demonstrations are performed for synthesizing missing sequences in multi-contrast MRI, and CT images from MRI. Our results indicate superiority of ResViT against competing methods in terms of qualitative observations and quantitative metrics

    Similar works

    Full text

    thumbnail-image

    Available Versions