111 research outputs found
Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning
Revealing latent structure in data is an active field of research, having brought exciting new models such as variational autoencoders and generative adversarial networks, and is essential to push machine learning towards unsupervised knowledge discovery. However, a major challenge is the lack of suitable benchmarks for an objective and quantitative evaluation of learned representations. To address this issue we introduce Morpho-MNIST. We extend the popular MNIST dataset by adding a morphometric analysis enabling quantitative comparison of different models, identification of the roles of latent variables, and characterisation of sample diversity. We further propose a set of quantifiable perturbations to assess the performance of unsupervised and supervised methods on challenging tasks such as outlier detection and domain adaptation
InfoNCE is a variational autoencoder
We show that a popular self-supervised learning method, InfoNCE, is a special
case of a new family of unsupervised learning methods, the self-supervised
variational autoencoder (SSVAE). SSVAEs circumvent the usual VAE requirement to
reconstruct the data by using a carefully chosen implicit decoder. The InfoNCE
objective was motivated as a simplified parametric mutual information
estimator. Under one choice of prior, the SSVAE objective (i.e. the ELBO) is
exactly equal to the mutual information (up to constants). Under an alternative
choice of prior, the SSVAE objective is exactly equal to the simplified
parametric mutual information estimator used in InfoNCE (up to constants).
Importantly, the use of simplified parametric mutual information estimators is
believed to be critical to obtain good high-level representations, and the
SSVAE framework naturally provides a principled justification for using prior
information to choose these estimators
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Disentangling Content and Motion for Text-Based Neural Video Manipulation
Giving machines the ability to imagine possible new objects or scenes from
linguistic descriptions and produce their realistic renderings is arguably one
of the most challenging problems in computer vision. Recent advances in deep
generative models have led to new approaches that give promising results
towards this goal. In this paper, we introduce a new method called DiCoMoGAN
for manipulating videos with natural language, aiming to perform local and
semantic edits on a video clip to alter the appearances of an object of
interest. Our GAN architecture allows for better utilization of multiple
observations by disentangling content and motion to enable controllable
semantic edits. To this end, we introduce two tightly coupled networks: (i) a
representation network for constructing a concise understanding of motion
dynamics and temporally invariant content, and (ii) a translation network that
exploits the extracted latent content representation to actuate the
manipulation according to the target description. Our qualitative and
quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms
existing frame-based methods, producing temporally coherent and semantically
more meaningful results
- …