    Multimodal and disentangled representation learning for medical image analysis

    Automated medical image analysis is a growing research field with various applications in modern healthcare. Furthermore, a multitude of imaging techniques (or modalities) have been developed, such as Magnetic Resonance (MR) and Computed Tomography (CT), to attenuate different organ characteristics. Research on image analysis is predominately driven by deep learning methods due to their demonstrated performance. In this thesis, we argue that their success and generalisation relies on learning good latent representations. We propose methods for learning spatial representations that are suitable for medical image data, and can combine information coming from different modalities. Specifically, we aim to improve cardiac MR segmentation, a challenging task due to varied images and limited expert annotations, by considering complementary information present in (potentially unaligned) images of other modalities. In order to evaluate the benefit of multimodal learning, we initially consider a synthesis task on spatially aligned multimodal brain MR images. We propose a deep network of multiple encoders and decoders, which we demonstrate outperforms existing approaches. The encoders (one per input modality) map the multimodal images into modality invariant spatial feature maps. Common and unique information is combined into a fused representation, that is robust to missing modalities, and can be decoded into synthetic images of the target modalities. Different experimental settings demonstrate the benefit of multimodal over unimodal synthesis, although input and output image pairs are required for training. The need for paired images can be overcome with the cycle consistency principle, which we use in conjunction with adversarial training to transform images from one modality (e.g. MR) to images in another (e.g. CT). This is useful especially in cardiac datasets, where different spatial and temporal resolutions make image pairing difficult, if not impossible. Segmentation can also be considered as a form of image synthesis, if one modality consists of semantic maps. We consider the task of extracting segmentation masks for cardiac MR images, and aim to overcome the challenge of limited annotations, by taking into account unannanotated images which are commonly ignored. We achieve this by defining suitable latent spaces, which represent the underlying anatomies (spatial latent variable), as well as the imaging characteristics (non-spatial latent variable). Anatomical information is required for tasks such as segmentation and regression, whereas imaging information can capture variability in intensity characteristics for example due to different scanners. We propose two models that disentangle cardiac images at different levels: the first extracts the myocardium from the surrounding information, whereas the second fully separates the anatomical from the imaging characteristics. Experimental analysis confirms the utility of disentangled representations in semi-supervised segmentation, and in regression of cardiac indices, while maintaining robustness to intensity variations such as the ones induced by different modalities. Finally, our prior research is aggregated into one framework that encodes multimodal images into disentangled anatomical and imaging factors. Several challenges of multimodal cardiac imaging, such as input misalignments and the lack of expert annotations, are successfully handled in the shared anatomy space. Furthermore, we demonstrate that this approach can be used to combine complementary anatomical information for the purpose of multimodal segmentation. This can be achieved even when no annotations are provided for one of the modalities. This thesis creates new avenues for further research in the area of multimodal and disentangled learning with spatial representations, which we believe are key to more generalised deep learning solutions in healthcare

    Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging

    Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include inter-observer variability, class imbalance, dataset shifts, inter- and intra-tumour heterogeneity, malignancy determination, and treatment effect uncertainty. Given the recent advancements in image synthesis, Generative Adversarial Networks (GANs), and adversarial training, we assess the potential of these technologies to address a number of key challenges of cancer imaging. We categorise these challenges into (a) data scarcity and imbalance, (b) data access and privacy, (c) data annotation and segmentation, (d) cancer detection and diagnosis, and (e) tumour profiling, treatment planning and monitoring. Based on our analysis of 164 publications that apply adversarial training techniques in the context of cancer imaging, we highlight multiple underexplored solutions with research potential. We further contribute the Synthesis Study Trustworthiness Test (SynTRUST), a meta-analysis framework for assessing the validation rigour of medical image synthesis studies. SynTRUST is based on 26 concrete measures of thoroughness, reproducibility, usefulness, scalability, and tenability. Based on SynTRUST, we analyse 16 of the most promising cancer imaging challenge solutions and observe a high validation rigour in general, but also several desirable improvements. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on data synthesis and adversarial networks in the artificial intelligence community

    Deep generative models for medical image synthesis and strategies to utilise them

    Medical imaging has revolutionised the diagnosis and treatments of diseases since the first medical image was taken using X-rays in 1895. As medical imaging became an essential tool in a modern healthcare system, more medical imaging techniques have been invented, such as Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Computed Tomography (CT), Ultrasound, etc. With the advance of medical imaging techniques, the demand for processing and analysing these complex medical images is increasing rapidly. Efforts have been put on developing approaches that can automatically analyse medical images. With the recent success of deep learning (DL) in computer vision, researchers have applied and proposed many DL-based methods in the field of medical image analysis. However, one problem with data-driven DL-based methods is the lack of data. Unlike natural images, medical images are more expensive to acquire and label. One way to alleviate the lack of medical data is medical image synthesis. In this thesis, I first start with pseudo healthy synthesis, which is to create a ‘healthy’ looking medical image from a pathological one. The synthesised pseudo healthy images can be used for the detection of pathology, segmentation, etc. Several challenges exist with this task. The first challenge is the lack of ground-truth data, as a subject cannot be healthy and diseased at the same time. The second challenge is how to evaluate the generated images. In this thesis, I propose a deep learning method to learn to generate pseudo healthy images with adversarial and cycle consistency losses to overcome the lack of ground-truth data. I also propose several metrics to evaluate the quality of synthetic ‘healthy’ images. Pseudo healthy synthesis can be viewed as transforming images between discrete domains, e.g. from pathological domain to healthy domain. However, there are some changes in medical data that are continuous, e.g. brain ageing progression. Brain changes as age increases. With the ageing global population, research on brain ageing has attracted increasing attention. In this thesis, I propose a deep learning method that can simulate such brain ageing progression. Specifically, longitudinal brain data are not easy to acquire; if some exist, they only cover several years. Thus, the proposed method focuses on learning subject-specific brain ageing progression without training on longitudinal data. As there are other factors, such as neurodegenerative diseases, that can affect brain ageing, the proposed model also considers health status, i.e. the existence of Alzheimer’s Disease (AD). Furthermore, to evaluate the quality of synthetic aged images, I define several metrics and conducted a series of experiments. Suppose we have a pre-trained deep generative model and a downstream tasks model, say a classifier. One question is how to make the best of the generative model to improve the performance of the classifier. In this thesis, I propose a simple procedure that can discover the ‘weakness’ of the classifier and guide the generator to synthesise counterfactuals (synthetic data) that are hard for the classifier. The proposed procedure constructs an adversarial game between generative factors of the generator and the classifier. We demonstrate the effectiveness of this proposed procedure through a series of experiments. Furthermore, we consider the application of generative models in a continual learning context and investigate the usefulness of them to alleviate spurious correlation. This thesis creates new avenues for further research in the area of medical image synthesis and how to utilise the medical generative models, which we believe could be important for future studies in medical image analysis with deep learning

    Generating virtual patients of high-resolution MR angiography from non-angiographic multi-contrast MRIs for In-silico trials

    Despite success on multi-contrast MR image synthesis, generating specific modalities remains challenging. Those include Magnetic Resonance Angiography (MRA) that highlights details of vascular anatomy using specialised imaging sequences for emphasising inflow effect. This work proposes an end-to-end generative adversarial network that can synthesise anatomically plausible, high-resolution 3D MRA images using commonly acquired multi-contrast MR images (e.g. T1/T2/PD-weighted MR images) for the same subject whilst preserving the continuity of vascular anatomy. A reliable technique for MRA synthesis would unleash the research potential of very few population databases with imaging modalities (such as MRA) that enable quantitative characterisation of whole-brain vasculature. Our work is motivated by the need to generate digital twins and virtual patients of cerebrovascular anatomy for in-silico studies and/or in-silico trials. We propose a dedicated generator and discriminator that leverage the shared and complementary features of multi-source images. We design a composite loss function for emphasising vascular properties by minimising the statistical difference between the feature representations of the target images and the synthesised outputs in both 3D volumetric and 2D projection domains. Experimental results show that the proposed method can synthesise high-quality MRA images and outperform the state-of-the-art generative models both qualitatively and quantitatively. The importance assessment reveals that T2 and PD-weighted images are better predictors of MRA images than T1; and PD-weighted images contribute to better visibility of small vessel branches towards the peripheral regions. In addition, the proposed approach can generalise to unseen data acquired at different imaging centres with different scanners, whilst synthesising MRAs and vascular geometries that maintain vessel continuity. The results show the potential for use of the proposed approach to generating digital twin cohorts of cerebrovascular anatomy at scale from structural MR images typically acquired in population imaging initiatives

    A systematic literature review: deep learning techniques for synthetic medical image generation and their applications in radiotherapy

    The aim of this systematic review is to determine whether Deep Learning (DL) algorithms can provide a clinically feasible alternative to classic algorithms for synthetic Computer Tomography (sCT). The following categories are presented in this study: ∙ MR-based treatment planning and synthetic CT generation techniques. ∙ Generation of synthetic CT images based on Cone Beam CT images. ∙ Low-dose CT to High-dose CT generation. ∙ Attenuation correction for PET images. To perform appropriate database searches, we reviewed journal articles published between January 2018 and June 2023. Current methodology, study strategies, and results with relevant clinical applications were analyzed as we outlined the state-of-the-art of deep learning based approaches to inter-modality and intra-modality image synthesis. This was accomplished by contrasting the provided methodologies with traditional research approaches. The key contributions of each category were highlighted, specific challenges were identified, and accomplishments were summarized. As a final step, the statistics of all the cited works from various aspects were analyzed, which revealed that DL-based sCTs have achieved considerable popularity, while also showing the potential of this technology. In order to assess the clinical readiness of the presented methods, we examined the current status of DL-based sCT generation

    Ea-GANs: Edge-Aware Generative Adversarial Networks for Cross-Modality MR Image Synthesis

    Magnetic resonance (MR) imaging is a widely used medical imaging protocol that can be configured to provide different contrasts between the tissues in human body. By setting different scanning parameters, each MR imaging modality reflects the unique visual characteristic of scanned body part, benefiting the subsequent analysis from multiple perspectives. To utilize the complementary information from multiple imaging modalities, cross-modality MR image synthesis has aroused increasing research interest recently. However, most existing methods only focus on minimizing pixel/voxel-wise intensity difference but ignore the textural details of image content structure, which affects the quality of synthesized images. In this paper, we propose edge-aware generative adversarial networks (Ea-GANs) for cross-modality MR image synthesis. Specifically, we integrate edge information, which reflects the textural structure of image content and depicts the boundaries of different objects in images, to reduce this gap. Corresponding to different learning strategies, two frameworks are proposed, i.e., a generator-induced Ea-GAN (gEa-GAN) and a discriminator-induced Ea-GAN (dEa-GAN). The gEa-GAN incorporates the edge information via its generator, while the dEa-GAN further does this from both the generator and the discriminator so that the edge similarity is also adversarially learned. In addition, the proposed Ea-GANs are 3D-based and utilize hierarchical features to capture contextual information. The experimental results demonstrate that the proposed Ea-GANs, especially the dEa-GAN, outperform multiple state-of-the-art methods for cross-modality MR image synthesis in both qualitative and quantitative measures. Moreover, the dEa-GAN also shows excellent generality to generic image synthesis tasks on benchmark datasets about facades, maps, and cityscapes

    Generative AI for computational pathology

    This thesis investigates generative AI algorithms for enhancing diagnostic accuracy and research in computational histopathology, covering image generation techniques including high-resolution tissue images, synthetic histology data, and challenges in large-scale image manipulation. In the field of computational pathology, where histology images are large in size and visual context is crucial, synthesis of large high-resolution images via generative modeling is a challenging task due to memory and computational constraints. To address this challenge, we propose a novel framework called SAFRON (Stitching Across the FROntiers Network). We use the proposed framework for generating, to the best of our knowledge, the largest-sized synthetic histology images to date (up to 11K×8K pixels). We quantitatively assess these images through Frechet Inception Distance and pathologist realism scores. We also propose SynCLay, an interactive framework that generates histology images from cellular layouts, aiding the study of cell roles present in the tumor microenvironment. Coupled with a parametric model, it generates colon images and cellular counts based on the grade of differentiation and cellularities of different cells. We also show that augmenting limited real data with the synthetic data generated by our framework can significantly boost prediction performance of the cellular composition prediction task. Next, we propose TIGGLay (Tissue Image Generation from Glandular Layout) framework that generates realistic colorectal cancer histology images and masks from glandular layouts. We show the appearance of glands can be controlled by user inputs such as number of glands, their locations and sizes. Moreover, the thesis provides a methodology for constructing realistic glandular masks using vector quantized variational autoencoder based frameworks such as latent diffusion models. Finally, we discuss limitations and future research directions. These involve extending the frameworks to generate synthetic images for varied carcinoma types and tissues, as well as generating whole-slide images based on known parameters

    Multimodal Adversarial Learning

    Deep Convolutional Neural Networks (DCNN) have proven to be an exceptional tool for object recognition, generative modelling, and multi-modal learning in various computer vision applications. However, recent findings have shown that such state-of-the-art models can be easily deceived by inserting slight imperceptible perturbations to key pixels in the input. A good target detection systems can accurately identify targets by localizing their coordinates on the input image of interest. This is ideally achieved by labeling each pixel in an image as a background or a potential target pixel. However, prior research still confirms that such state of the art targets models are susceptible to adversarial attacks. In the case of generative models, facial sketches drawn by artists mostly used by law enforcement agencies depend on the ability of the artist to clearly replicate all the key facial features that aid in capturing the true identity of a subject. Recent works have attempted to synthesize these sketches into plausible visual images to improve visual recognition and identification. However, synthesizing photo-realistic images from sketches proves to be an even more challenging task, especially for sensitive applications such as suspect identification. However, the incorporation of hybrid discriminators, which perform attribute classification of multiple target attributes, a quality guided encoder that minimizes the perceptual dissimilarity of the latent space embedding of the synthesized and real image at different layers in the network have shown to be powerful tools towards better multi modal learning techniques. In general, our overall approach was aimed at improving target detection systems and the visual appeal of synthesized images while incorporating multiple attribute assignment to the generator without compromising the identity of the synthesized image. We synthesized sketches using XDOG filter for the CelebA, Multi-modal and CelebA-HQ datasets and from an auxiliary generator trained on sketches from CUHK, IIT-D and FERET datasets. Our results overall for different model applications are impressive compared to current state of the art

    Deep Learning for Music Information Retrieval in Limited Data Scenarios.

    PhD ThesisWhile deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, over tting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate \data-driven priors" that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-speci c knowledge. We rst consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to nd a weight initialisation for a \universal MIR model" so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-speci c feature engineering, such as xed spectrogram representations of audio commonly used in machine listening. To this end, we propose \Wave-U-Net", an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogrambased counterpart. Finally, we derive \Seq-U-Net" as a causal variant of Wave- U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally e cient.

    Image Data Augmentation from Small Training Datasets Using Generative Adversarial Networks (GANs)

    The scarcity of labelled data is a serious problem since deep models generally require a large amount of training data to achieve desired performance. Data augmentation is widely adopted to enhance the diversity of original datasets and further improve the performance of deep learning models. Learning-based methods, compared to traditional techniques, are specialized in feature extraction, which enhances the effectiveness of data augmentation. Generative adversarial networks (GANs), one of the learning-based generative models, have made remarkable advances in data synthesis. However, GANs still face many challenges in generating high-quality augmented images from small datasets because learning-based generative methods are difficult to create reliable outcomes without sufficient training data. This difficulty deteriorates the data augmentation applications using learning-based methods. In this thesis, to tackle the problem of labelled data scarcity and the training difficulty of augmenting image data from small datasets, three novel GAN models suitable for training with a small number of training samples have been proposed based on three different mapping relationships between the input and output images, including one-to-many mapping, one-to-one mapping, and many-to-many mapping. The proposed GANs employ limited training data, such as a small number of images and limited conditional features, and the synthetic images generated by the proposed GANs are expected to generate images of not only high generative quality but also desirable data diversity. To evaluate the effectiveness of the augmented images generated by the proposed models, inception distances and human perception methods are adopted. Additionally, different image classification tasks were carried out and accuracies from using the original datasets and the augmented datasets were compared. Experimental results illustrate the image classification performance based on convolutional neural networks, i.e., AlexNet, GoogLeNet, ResNet and VGGNet, is comprehensively enhanced, and the scale of improvement is significant when a small number of training samples are involved