9,750 research outputs found
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art
in realistic speech synthesis, consistently rated as more natural sounding for
many different languages than any previous system. However, because WaveNet
relies on sequential generation of one audio sample at a time, it is poorly
suited to today's massively parallel computers, and therefore hard to deploy in
a real-time production setting. This paper introduces Probability Density
Distillation, a new method for training a parallel feed-forward network from a
trained WaveNet with no significant difference in quality. The resulting system
is capable of generating high-fidelity speech samples at more than 20 times
faster than real-time, and is deployed online by Google Assistant, including
serving multiple English and Japanese voices
Neural Style Transfer: A Review
The seminal work of Gatys et al. demonstrated the power of Convolutional
Neural Networks (CNNs) in creating artistic imagery by separating and
recombining image content and style. This process of using CNNs to render a
content image in different styles is referred to as Neural Style Transfer
(NST). Since then, NST has become a trending topic both in academic literature
and industrial applications. It is receiving increasing attention and a variety
of approaches are proposed to either improve or extend the original NST
algorithm. In this paper, we aim to provide a comprehensive overview of the
current progress towards NST. We first propose a taxonomy of current algorithms
in the field of NST. Then, we present several evaluation methods and compare
different NST algorithms both qualitatively and quantitatively. The review
concludes with a discussion of various applications of NST and open problems
for future research. A list of papers discussed in this review, corresponding
codes, pre-trained models and more comparison results are publicly available at
https://github.com/ycjing/Neural-Style-Transfer-Papers.Comment: Project page: https://github.com/ycjing/Neural-Style-Transfer-Paper
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework
In this paper, we aim at improving the performance of synthesized speech in
statistical parametric speech synthesis (SPSS) based on a generative
adversarial network (GAN). In particular, we propose a novel architecture
combining the traditional acoustic loss function and the GAN's discriminative
loss under a multi-task learning (MTL) framework. The mean squared error (MSE)
is usually used to estimate the parameters of deep neural networks, which only
considers the numerical difference between the raw audio and the synthesized
one. To mitigate this problem, we introduce the GAN as a second task to
determine if the input is a natural speech with specific conditions. In this
MTL framework, the MSE optimization improves the stability of GAN, and at the
same time GAN produces samples with a distribution closer to natural speech.
Listening tests show that the multi-task architecture can generate more natural
speech that satisfies human perception than the conventional methods.Comment: Submitted to Automatic Speech Recognition and Understanding (ASRU)
2017 Worksho
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
This paper proposes an effective probability density distillation (PDD)
algorithm for WaveNet-based parallel waveform generation (PWG) systems.
Recently proposed teacher-student frameworks in the PWG system have
successfully achieved a real-time generation of speech signals. However, the
difficulties optimizing the PDD criteria without auxiliary losses result in
quality degradation of synthesized speech. To generate more natural speech
signals within the teacher-student framework, we propose a novel optimization
criterion based on generative adversarial networks (GANs). In the proposed
method, the inverse autoregressive flow-based student model is incorporated as
a generator in the GAN framework, and jointly optimized by the PDD mechanism
with the proposed adversarial learning method. As this process encourages the
student to model the distribution of realistic speech waveform, the perceptual
quality of the synthesized speech becomes much more natural. Our experimental
results verify that the PWG systems with the proposed method outperform both
those using conventional approaches, and also autoregressive generation systems
with a well-trained teacher WaveNet.Comment: Accepted to the conference of INTERSPEECH 201
Capacity allocation analysis of neural networks: A tool for principled architecture design
Designing neural network architectures is a task that lies somewhere between
science and art. For a given task, some architectures are eventually preferred
over others, based on a mix of intuition, experience, experimentation and luck.
For many tasks, the final word is attributed to the loss function, while for
some others a further perceptual evaluation is necessary to assess and compare
performance across models. In this paper, we introduce the concept of capacity
allocation analysis, with the aim of shedding some light on what network
architectures focus their modelling capacity on, when used on a given task. We
focus more particularly on spatial capacity allocation, which analyzes a
posteriori the effective number of parameters that a given model has allocated
for modelling dependencies on a given point or region in the input space, in
linear settings. We use this framework to perform a quantitative comparison
between some classical architectures on various synthetic tasks. Finally, we
consider how capacity allocation might translate in non-linear settings.Comment: 25 pages, 15 figure
Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization
The Sparsespeech model is an unsupervised acoustic model that can generate
discrete pseudo-labels for untranscribed speech. We extend the Sparsespeech
model to allow for sampling over a random discrete variable, yielding
pseudo-posteriorgrams. The degree of sparsity in this posteriorgram can be
fully controlled after the model has been trained. We use the Gumbel-Softmax
trick to approximately sample from a discrete distribution in the neural
network and this allows us to train the network efficiently with standard
backpropagation. The new and improved model is trained and evaluated on the
Libri-Light corpus, a benchmark for ASR with limited or no supervision. The
model is trained on 600h and 6000h of English read speech. We evaluate the
improved model using the ABX error measure and a semi-supervised setting with
10h of transcribed speech. We observe a relative improvement of up to 31.4% on
ABX error rates across speakers on the test set with the improved Sparsespeech
model on 600h of speech data and further improvements when we scale the model
to 6000h
A New Multilabel System for Automatic Music Emotion Recognition
Achieving advancements in automatic recognition of emotions that music can
induce require considering multiplicity and simultaneity of emotions.
Comparison of different machine learning algorithms performing multilabel and
multiclass classification is the core of our work. The study analyzes the
implementation of the Geneva Emotional Music Scale 9 in the Emotify music
dataset and investigates its adoption from a machine-learning perspective. We
approach the scenario of emotions expression/induction through music as a
multilabel and multiclass problem, where multiple emotion labels can be adopted
for the same music track by each annotator (multilabel), and each emotion can
be identified or not in the music (multiclass). The aim is the automatic
recognition of induced emotions through music.Comment: 2 tables. Research supported by the EU through the MUSICAL-MOODS
project funded by the Marie Sklodowska-Curie Actions Individual Fellowships
Global Fellowships (MSCA-IF-GF) of the Horizon 2020 Programme
H2020/2014-2020, REA grant agreement n.65943
Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
Generating versatile and appropriate synthetic speech requires control over
the output expression separate from the spoken text. Important non-textual
speech variation is seldom annotated, in which case output control must be
learned in an unsupervised fashion. In this paper, we perform an in-depth study
of methods for unsupervised learning of control in statistical speech
synthesis. For example, we show that popular unsupervised training heuristics
can be interpreted as variational inference in certain autoencoder models. We
additionally connect these models to VQ-VAEs, another, recently-proposed class
of deep variational autoencoders, which we show can be derived from a very
similar mathematical argument. The implications of these new probabilistic
interpretations are discussed. We illustrate the utility of the various
approaches with an application to acoustic modelling for emotional speech
synthesis, where the unsupervised methods for learning expression control
(without access to emotional labels) are found to give results that in many
aspects match or surpass the previous best supervised approach.Comment: 17 pages, 4 figure
Single-sided Real-time PESQ Score Estimation
For several years now, the ITU-T's Perceptual Evaluation of Speech Quality
(PESQ) has been the reference for objective speech quality assessment. It is
widely deployed in commercial QoE measurement products, and it has been well
studied in the literature. While PESQ does provide reasonably good correlation
with subjective scores for VoIP applications, the algorithm itself is not
usable in a real-time context, since it requires a reference signal, which is
usually not available in normal conditions. In this paper we provide an
alternative technique for estimating PESQ scores in a single-sided fashion,
based on the Pseudo Subjective Quality Assessment (PSQA) technique.Comment: In Proceeding of Measurement of Speech, Audio and Video Quality in
Networks (MESAQIN'09), Prague, Czech Republic, June 2009, pp. 94-9
Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis
Recent studies have shown that text-to-speech synthesis quality can be
improved by using glottal vocoding. This refers to vocoders that parameterize
speech into two parts, the glottal excitation and vocal tract, that occur in
the human speech production apparatus. Current glottal vocoders generate the
glottal excitation waveform by using deep neural networks (DNNs). However, the
squared error-based training of the present glottal excitation models is
limited to generating conditional average waveforms, which fails to capture the
stochastic variation of the waveforms. As a result, shaped noise is added as
post-processing. In this study, we propose a new method for predicting glottal
waveforms by generative adversarial networks (GANs). GANs are generative models
that aim to embed the data distribution in a latent space, enabling generation
of new instances very similar to the original by randomly sampling the latent
distribution. The glottal pulses generated by GANs show a stochastic component
similar to natural glottal pulses. In our experiments, we compare synthetic
speech generated using glottal waveforms produced by both DNNs and GANs. The
results show that the newly proposed GANs achieve synthesis quality comparable
to that of widely-used DNNs, without using an additive noise component.Comment: Accepted in Interspeec
- …