3,357 research outputs found
Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks
Most methods of voice restoration for patients suffering from aphonia either
produce whispered or monotone speech. Apart from intelligibility, this type of
speech lacks expressiveness and naturalness due to the absence of pitch
(whispered speech) or artificial generation of it (monotone speech). Existing
techniques to restore prosodic information typically combine a vocoder, which
parameterises the speech signal, with machine learning techniques that predict
prosodic information. In contrast, this paper describes an end-to-end neural
approach for estimating a fully-voiced speech waveform from whispered
alaryngeal speech. By adapting our previous work in speech enhancement with
generative adversarial networks, we develop a speaker-dependent model to
perform whispered-to-voiced speech conversion. Preliminary qualitative results
show effectiveness in re-generating voiced speech, with the creation of
realistic pitch contours
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
In this paper we propose Flowtron: an autoregressive flow-based generative
network for text-to-speech synthesis with control over speech variation and
style transfer. Flowtron borrows insights from IAF and revamps Tacotron in
order to provide high-quality and expressive mel-spectrogram synthesis.
Flowtron is optimized by maximizing the likelihood of the training data, which
makes training simple and stable. Flowtron learns an invertible mapping of data
to a latent space that can be manipulated to control many aspects of speech
synthesis (pitch, tone, speech rate, cadence, accent). Our mean opinion scores
(MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech
quality. In addition, we provide results on control of speech variation,
interpolation between samples and style transfer between speakers seen and
unseen during training. Code and pre-trained models will be made publicly
available at https://github.com/NVIDIA/flowtronComment: 10 pages, 7 picture
WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks
We propose a learning-based filter that allows us to directly modify a
synthetic speech waveform into a natural speech waveform. Speech-processing
systems using a vocoder framework such as statistical parametric speech
synthesis and voice conversion are convenient especially for a limited number
of data because it is possible to represent and process interpretable acoustic
features over a compact space, such as the fundamental frequency (F0) and
mel-cepstrum. However, a well-known problem that leads to the quality
degradation of generated speech is an over-smoothing effect that eliminates
some detailed structure of generated/converted acoustic features. To address
this issue, we propose a synthetic-to-natural speech waveform conversion
technique that uses cycle-consistent adversarial networks and which does not
require any explicit assumption about speech waveform in adversarial learning.
In contrast to current techniques, since our modification is performed at the
waveform level, we expect that the proposed method will also make it possible
to generate `vocoder-less' sounding speech even if the input speech is
synthesized using a vocoder framework. The experimental results demonstrate
that our proposed method can 1) alleviate the over-smoothing effect of the
acoustic features despite the direct modification method used for the waveform
and 2) greatly improve the naturalness of the generated speech sounds.Comment: SLT201
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
The prosodic aspects of speech signals produced by current text-to-speech
systems are typically averaged over training material, and as such lack the
variety and liveliness found in natural speech. To avoid monotony and averaged
prosody contours, it is desirable to have a way of modeling the variation in
the prosodic aspects of speech, so audio signals can be synthesized in multiple
ways for a given text. We present a new, hierarchically structured conditional
variational autoencoder to generate prosodic features (fundamental frequency,
energy and duration) suitable for use with a vocoder or a generative model like
WaveNet. At inference time, an embedding representing the prosody of a sentence
may be sampled from the variational layer to allow for prosodic variation. To
efficiently capture the hierarchical nature of the linguistic input (words,
syllables and phones), both the encoder and decoder parts of the auto-encoder
are hierarchical, in line with the linguistic structure, with layers being
clocked dynamically at the respective rates. We show in our experiments that
our dynamic hierarchical network outperforms a non-hierarchical
state-of-the-art baseline, and, additionally, that prosody transfer across
sentences is possible by employing the prosody embedding of one sentence to
generate the speech signal of another
Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
This paper presents a simple yet effective method to achieve prosody transfer
from a reference speech signal to synthesized speech. The main idea is to
incorporate well-known acoustic correlates of prosody such as pitch and
loudness contours of the reference speech into a modern neural text-to-speech
(TTS) synthesizer such as Tacotron2 (TC2). More specifically, a small set of
acoustic features are extracted from reference audio and then used to condition
a TC2 synthesizer. The trained model is evaluated using subjective listening
tests and a novel objective evaluation of prosody transfer is proposed.
Listening tests show that the synthesized speech is rated as highly natural and
that prosody is successfully transferred from the reference speech signal to
the synthesized signal.Comment: 5 pages, in review for conference publicatio
Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions
This paper introduces an improved generative model for statistical parametric
speech synthesis (SPSS) based on WaveNet under a multi-task learning framework.
Different from the original WaveNet model, the proposed Multi-task WaveNet
employs the frame-level acoustic feature prediction as the secondary task and
the external fundamental frequency prediction model for the original WaveNet
can be removed. Therefore the improved WaveNet can generate high-quality speech
waveforms only conditioned on linguistic features. Multi-task WaveNet can
produce more natural and expressive speech by addressing the pitch prediction
error accumulation issue and possesses more succinct inference procedures than
the original WaveNet. Experimental results prove that the SPSS method proposed
in this paper can achieve better performance than the state-of-the-art approach
utilizing the original WaveNet in both objective and subjective preference
tests.Comment: Accepted by Interspeech 201
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
Emotional voice conversion aims to convert the spectrum and prosody to change
the emotional patterns of speech, while preserving the speaker identity and
linguistic content. Many studies require parallel speech data between different
emotional patterns, which is not practical in real life. Moreover, they often
model the conversion of fundamental frequency (F0) with a simple linear
transform. As F0 is a key aspect of intonation that is hierarchical in nature,
we believe that it is more adequate to model F0 in different temporal scales by
using wavelet transform. We propose a CycleGAN network to find an optimal
pseudo pair from non-parallel training data by learning forward and inverse
mappings simultaneously using adversarial and cycle-consistency losses. We also
study the use of continuous wavelet transform (CWT) to decompose F0 into ten
temporal scales, that describes speech prosody at different time resolution,
for effective F0 conversion. Experimental results show that our proposed
framework outperforms the baselines both in objective and subjective
evaluations.Comment: accepted by Speaker Odyssey 2020 in Tokyo, Japa
Deep Layered Learning in MIR
Deep learning has boosted the performance of many music information retrieval
(MIR) systems in recent years. Yet, the complex hierarchical arrangement of
music makes end-to-end learning hard for some MIR tasks - a very deep and
flexible processing chain is necessary to model some aspect of music audio.
Representations involving tones, chords, and rhythm are fundamental building
blocks of music. This paper discusses how these can be used as intermediate
targets and priors in MIR to deal with structurally complex learning problems,
with learning modules connected in a directed acyclic graph. It is suggested
that this strategy for inference, referred to as deep layered learning (DLL),
can help generalization by (1) - enforcing the validity and invariance of
intermediate representations during processing, and by (2) - letting the
inferred representations establish the musical organization to support
higher-level invariant processing. A background to modular music processing is
provided together with an overview of previous publications. Relevant concepts
from information processing, such as pruning, skip connections, and performance
supervision are reviewed within the context of DLL. A test is finally
performed, showing how layered learning affects pitch tracking. It is indicated
that especially offsets are easier to detect if guided by extracted framewise
fundamental frequencies.Comment: Submitted for publication. Feedback always welcom
AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms
This paper describes a method based on a sequence-to-sequence learning
(Seq2Seq) with attention and context preservation mechanism for voice
conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving
sequence modeling such as speech synthesis and recognition, machine
translation, and image captioning. In contrast to current VC techniques, our
method 1) stabilizes and accelerates the training procedure by considering
guided attention and proposed context preservation losses, 2) allows not only
spectral envelopes but also fundamental frequency contours and durations of
speech to be converted, 3) requires no context information such as phoneme
labels, and 4) requires no time-aligned source and target speech data in
advance. In our experiment, the proposed VC framework can be trained in only
one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the
synthesized speech is higher than that of speech converted by Gaussian mixture
model-based VC and is comparable to that of speech generated by recurrent
neural network-based text-to-speech synthesis, which can be regarded as an
upper limit on VC performance.Comment: Submitted to ICASSP201
FastPitch: Parallel Text-to-speech with Pitch Prediction
We present FastPitch, a fully-parallel text-to-speech model based on
FastSpeech, conditioned on fundamental frequency contours. The model predicts
pitch contours during inference. By altering these predictions, the generated
speech can be more expressive, better match the semantic of the utterance, and
in the end more engaging to the listener. Uniformly increasing or decreasing
pitch with FastPitch generates speech that resembles the voluntary modulation
of voice. Conditioning on frequency contours improves the overall quality of
synthesized speech, making it comparable to state-of-the-art. It does not
introduce an overhead, and FastPitch retains the favorable, fully-parallel
Transformer architecture, with over 900x real-time factor for mel-spectrogram
synthesis of a typical utterance.Comment: Accepted to ICASSP 202
- …