71 research outputs found
Robust model training and generalisation with Studentising flows
Normalising flows are tractable probabilistic models that leverage the power
of deep learning to describe a wide parametric family of distributions, all
while remaining trainable using maximum likelihood. We discuss how these
methods can be further improved based on insights from robust (in particular,
resistant) statistics. Specifically, we propose to endow flow-based models with
fat-tailed latent distributions such as multivariate Student's , as a simple
drop-in replacement for the Gaussian distribution used by conventional
normalising flows. While robustness brings many advantages, this paper explores
two of them: 1) We describe how using fatter-tailed base distributions can give
benefits similar to gradient clipping, but without compromising the asymptotic
consistency of the method. 2) We also discuss how robust ideas lead to models
with reduced generalisation gap and improved held-out data likelihood.
Experiments on several different datasets confirm the efficacy of the proposed
approach in both regards.Comment: 9 pages, 8 figures, accepted for publication at INNF+ 2020 (Second
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit
Likelihood Models
Full-Glow: Fully conditional Glow for more realistic image generation
Autonomous agents, such as driverless cars, require large amounts of labeled
visual data for their training. A viable approach for acquiring such data is
training a generative model with collected real data, and then augmenting the
collected real dataset with synthetic images from the model, generated with
control of the scene layout and ground truth labeling. In this paper we propose
Full-Glow, a fully conditional Glow-based architecture for generating plausible
and realistic images of novel street scenes given a semantic segmentation map
indicating the scene layout. Benchmark comparisons show our model to outperform
recent works in terms of the semantic segmentation performance of a pretrained
PSPNet. This indicates that images from our model are, to a higher degree than
from other models, similar to real images of the same kinds of scenes and
objects, making them suitable as training data for a visual semantic
segmentation or object recognition system.Comment: 17 pages, 12 figure
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Recent work has explored using self-supervised learning (SSL) speech
representations such as wav2vec2.0 as the representation medium in standard
two-stage TTS, in place of conventionally used mel-spectrograms. It is however
unclear which speech SSL is the better fit for TTS, and whether or not the
performance differs between read and spontaneous TTS, the later of which is
arguably more challenging. This study aims at addressing these questions by
testing several speech SSLs, including different layers of the same SSL, in
two-stage TTS on both read and spontaneous corpora, while maintaining constant
TTS model architecture and training settings. Results from listening tests show
that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other
tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work
sheds light on both how speech SSL can readily improve current TTS systems, and
how SSLs compare in the challenging generative task of TTS. Audio examples can
be found at https://www.speech.kth.se/tts-demos/ssr_ttsComment: 5 pages, 2 figures. ICASSP Workshop SASB (Self-Supervision in Audio,
Speech and Beyond)202
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Self-supervised learning (SSL) speech representations learned from large
amounts of diverse, mixed-quality speech data without transcriptions are
gaining ground in many speech technology applications. Prior work has shown
that SSL is an effective intermediate representation in two-stage
text-to-speech (TTS) for both read and spontaneous speech. However, it is still
not clear which SSL and which layer from each SSL model is most suited for
spontaneous TTS. We address this shortcoming by extending the scope of
comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within
each SSL. Furthermore, SSL has also shown potential in predicting the mean
opinion scores (MOS) of synthesized speech, but this has only been done in
read-speech MOS prediction. We extend an SSL-based MOS prediction framework
previously developed for scoring read speech synthesis and evaluate its
performance on synthesized spontaneous speech. All experiments are conducted
twice on two different spontaneous corpora in order to find generalizable
trends. Overall, we present comprehensive experimental results on the use of
SSL in spontaneous TTS and MOS prediction to further quantify and understand
how SSL can be used in spontaneous TTS. Audios samples:
https://www.speech.kth.se/tts-demos/sp_ssl_ttsComment: 7 pages, 2 figures. 12th ISCA Speech Synthesis Workshop (SSW) 202
- …