198 research outputs found
Disentanglement via Latent Quantization
In disentangled representation learning, a model is asked to tease apart a
dataset's underlying sources of variation and represent them independently of
one another. Since the model is provided with no ground truth information about
these sources, inductive biases take a paramount role in enabling
disentanglement. In this work, we construct an inductive bias towards
compositionally encoding and decoding data by enforcing a harsh communication
bottleneck. Concretely, we do this by (i) quantizing the latent space into
learnable discrete codes with a separate scalar codebook per dimension and (ii)
applying strong model regularization via an unusually high weight decay.
Intuitively, the quantization forces the encoder to use a small number of
latent values across many datapoints, which in turn enables the decoder to
assign a consistent meaning to each value. Regularization then serves to drive
the model towards this parsimonious strategy. We demonstrate the broad
applicability of this approach by adding it to both basic data-reconstructing
(vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models. In
order to reliably assess these models, we also propose InfoMEC, new metrics for
disentanglement that are cohesively grounded in information theory and fix
well-established shortcomings in previous metrics. Together with
regularization, latent quantization dramatically improves the modularity and
explicitness of learned representations on a representative suite of benchmark
datasets. In particular, our quantized-latent autoencoder (QLAE) consistently
outperforms strong methods from prior work in these key disentanglement
properties without compromising data reconstruction.Comment: 20 pages, 8 figures, code available at
https://github.com/kylehkhsu/disentangl
Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge
With the demand for autonomous control and personalized speech generation,
the style control and transfer in Text-to-Speech (TTS) is becoming more and
more important. In this paper, we propose a new TTS system that can perform
style transfer with interpretability and high fidelity. Firstly, we design a
TTS system that combines variational autoencoder (VAE) and diffusion refiner to
get refined mel-spectrograms. Specifically, a two-stage and a one-stage system
are designed respectively, to improve the audio quality and the performance of
style transfer. Secondly, a diffusion bridge of quantized VAE is designed to
efficiently learn complex discrete style representations and improve the
performance of style transfer. To have a better ability of style transfer, we
introduce ControlVAE to improve the reconstruction quality and have good
interpretability simultaneously. Experiments on LibriTTS dataset demonstrate
that our method is more effective than baseline models.Comment: Accepted at Interspeech202
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
- …