1,581 research outputs found
Unsupervised Controllable Generation with Self-Training
Recent generative adversarial networks (GANs) are able to generate impressive
photo-realistic images. However, controllable generation with GANs remains a
challenging research problem. Achieving controllable generation requires
semantically interpretable and disentangled factors of variation. It is
challenging to achieve this goal using simple fixed distributions such as
Gaussian distribution. Instead, we propose an unsupervised framework to learn a
distribution of latent codes that control the generator through self-training.
Self-training provides an iterative feedback in the GAN training, from the
discriminator to the generator, and progressively improves the proposal of the
latent codes as training proceeds. The latent codes are sampled from a latent
variable model that is learned in the feature space of the discriminator. We
consider a normalized independent component analysis model and learn its
parameters through tensor factorization of the higher-order moments. Our
framework exhibits better disentanglement compared to other variants such as
the variational autoencoder, and is able to discover semantically meaningful
latent codes without any supervision. We demonstrate empirically on both cars
and faces datasets that each group of elements in the learned code controls a
mode of variation with a semantic meaning, e.g. pose or background change. We
also demonstrate with quantitative metrics that our method generates better
results compared to other approaches
Unsupervised Controllable Generation with Self-Training
Recent generative adversarial networks (GANs) are able to generate impressive photo-realistic images. However, controllable generation with GANs remains a challenging research problem. Achieving controllable generation requires semantically interpretable and disentangled factors of variation. It is challenging to achieve this goal using simple fixed distributions such as Gaussian distribution. Instead, we propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training. Self-training provides an iterative feedback in the GAN training, from the discriminator to the generator, and progressively improves the proposal of the latent codes as training proceeds. The latent codes are sampled from a latent variable model that is learned in the feature space of the discriminator. We consider a normalized independent component analysis model and learn its parameters through tensor factorization of the higher-order moments. Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder, and is able to discover semantically meaningful latent codes without any supervision. We demonstrate empirically on both cars and faces datasets that each group of elements in the learned code controls a mode of variation with a semantic meaning, e.g. pose or background change. We also demonstrate with quantitative metrics that our method generates better results compared to other approaches
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
ESTformer: Transformer Utilizing Spatiotemporal Dependencies for EEG Super-resolution
Towards practical applications of Electroencephalography (EEG) data,
lightweight acquisition devices, equipped with a few electrodes, result in a
predicament where analysis methods can only leverage EEG data with extremely
low spatial resolution. Recent methods mainly focus on using mathematical
interpolation methods and Convolutional Neural Networks for EEG
super-resolution (SR), but they suffer from high computation costs, extra bias,
and few insights in spatiotemporal dependency modeling. To this end, we propose
the ESTformer, an EEG SR framework utilizing spatiotemporal dependencies based
on the Transformer. The ESTformer applies positional encoding methods and the
Multi-head Self-attention mechanism to the space and time dimensions, which can
learn spatial structural information and temporal functional variation. The
ESTformer, with the fixed masking strategy, adopts a mask token to up-sample
the low-resolution (LR) EEG data in case of disturbance from mathematical
interpolation methods. On this basis, we design various Transformer blocks to
construct the Spatial Interpolation Module (SIM) and the Temporal
Reconstruction Module (TRM). Finally, the ESTformer cascades the SIM and the
TRM to capture and model spatiotemporal dependencies for EEG SR with fidelity.
Extensive experimental results on two EEG datasets show the effectiveness of
the ESTformer against previous state-of-the-art methods and verify the
superiority of the SR data to the LR data in EEG-based downstream tasks of
person identification and emotion recognition. The proposed ESTformer
demonstrates the versatility of the Transformer for EEG SR tasks
Pathology Synthesis of 3D-Consistent Cardiac MR Images using 2D VAEs and GANs
We propose a method for synthesizing cardiac magnetic resonance (MR) images
with plausible heart pathologies and realistic appearances for the purpose of
generating labeled data for the application of supervised deep-learning (DL)
training. The image synthesis consists of label deformation and label-to-image
translation tasks. The former is achieved via latent space interpolation in a
VAE model, while the latter is accomplished via a label-conditional GAN model.
We devise three approaches for label manipulation in the latent space of the
trained VAE model; i) \textbf{intra-subject synthesis} aiming to interpolate
the intermediate slices of a subject to increase the through-plane resolution,
ii) \textbf{inter-subject synthesis} aiming to interpolate the geometry and
appearance of intermediate images between two dissimilar subjects acquired with
different scanner vendors, and iii) \textbf{pathology synthesis} aiming to
synthesize a series of pseudo-pathological synthetic subjects with
characteristics of a desired heart disease. Furthermore, we propose to model
the relationship between 2D slices in the latent space of the VAE prior to
reconstruction for generating 3D-consistent subjects from stacking up 2D
slice-by-slice generations. We demonstrate that such an approach could provide
a solution to diversify and enrich an available database of cardiac MR images
and to pave the way for the development of generalizable DL-based image
analysis algorithms. We quantitatively evaluate the quality of the synthesized
data in an augmentation scenario to achieve generalization and robustness to
multi-vendor and multi-disease data for image segmentation. Our code is
available at https://github.com/sinaamirrajab/CardiacPathologySynthesis.Comment: Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://www.melba-journal.org/2023:01
Recent Advances in Image Restoration with Applications to Real World Problems
In the past few decades, imaging hardware has improved tremendously in terms of resolution, making widespread usage of images in many diverse applications on Earth and planetary missions. However, practical issues associated with image acquisition are still affecting image quality. Some of these issues such as blurring, measurement noise, mosaicing artifacts, low spatial or spectral resolution, etc. can seriously affect the accuracy of the aforementioned applications. This book intends to provide the reader with a glimpse of the latest developments and recent advances in image restoration, which includes image super-resolution, image fusion to enhance spatial, spectral resolution, and temporal resolutions, and the generation of synthetic images using deep learning techniques. Some practical applications are also included
- …