14 research outputs found
Model-based Parametric Prosody Synthesis with Deep Neural Network
Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the âlinguistic-motorâ mapping given the âmotor-acousticâ mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations
Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning
Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model â target approximation (TA) â between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis
Statistical parametric speech synthesis (SPSS) has seen improvements over
recent years, especially in terms of intelligibility. Synthetic speech is often clear
and understandable, but it can also be bland and monotonous. Proper generation
of natural speech prosody is still a largely unsolved problem. This is relevant
especially in the context of expressive audiobook speech synthesis, where speech
is expected to be fluid and captivating.
In general, prosody can be seen as a layer that is superimposed on the segmental
(phone) sequence. Listeners can perceive the same melody or rhythm
in different utterances, and the same segmental sequence can be uttered with a
different prosodic layer to convey a different message. For this reason, prosody
is commonly accepted to be inherently suprasegmental. It is governed by longer
units within the utterance (e.g. syllables, words, phrases) and beyond the utterance
(e.g. discourse). However, common techniques for the modeling of speech
prosody - and speech in general - operate mainly on very short intervals, either at
the state or frame level, in both hidden Markov model (HMM) and deep neural
network (DNN) based speech synthesis.
This thesis presents contributions supporting the claim that stronger representations
of suprasegmental variation are essential for the natural generation of
fundamental frequency for statistical parametric speech synthesis. We conceptualize
the problem by dividing it into three sub-problems: (1) representations of
acoustic signals, (2) representations of linguistic contexts, and (3) the mapping
of one representation to another. The contributions of this thesis provide novel
methods and insights relating to these three sub-problems.
In terms of sub-problem 1, we propose a multi-level representation of f0 using
the continuous wavelet transform and the discrete cosine transform, as well
as a wavelet-based decomposition strategy that is linguistically and perceptually
motivated. In terms of sub-problem 2, we investigate additional linguistic
features such as text-derived word embeddings and syllable bag-of-phones and
we propose a novel method for learning word vector representations based on
acoustic counts. Finally, considering sub-problem 3, insights are given regarding
hierarchical models such as parallel and cascaded deep neural networks
Prosody generation for text-to-speech synthesis
The absence of convincing intonation makes current parametric speech
synthesis systems sound dull and lifeless, even when trained on expressive
speech data. Typically, these systems use regression techniques to predict the
fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth
pitch contours and fails to construct an appropriate prosodic structure
across the full utterance. In order to capture and reproduce larger-scale
pitch patterns, we propose a template-based approach for automatic F0 generation,
where per-syllable pitch-contour templates (from a small, automatically
learned set) are predicted by a recurrent neural network (RNN). The use of
syllable templates mitigates the over-smoothing problem and is able to reproduce
pitch patterns observed in the data. The use of an RNN, paired with connectionist
temporal classification (CTC), enables the prediction of structure in
the pitch contour spanning the entire utterance. This novel F0 prediction system
is used alongside separate LSTMs for predicting phone durations and the
other acoustic features, to construct a complete text-to-speech system. Later,
we investigate the benefits of including long-range dependencies in duration
prediction at frame-level using uni-directional recurrent neural networks.
Since prosody is a supra-segmental property, we consider an alternate approach
to intonation generation which exploits long-term dependencies of
F0 by effective modelling of linguistic features using recurrent neural networks.
For this purpose, we propose a hierarchical encoder-decoder and
multi-resolution parallel encoder where the encoder takes word and higher
level linguistic features at the input and upsamples them to phone-level
through a series of hidden layers and is integrated into a Hybrid system which
is then submitted to Blizzard challenge workshop. We then highlight some of
the issues in current approaches and a plan for future directions of investigation
is outlined along with on-going work
Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques
The growing use of voice user interfaces has led to a surge in the collection
and storage of speech data. While data collection allows for the development of
efficient tools powering most speech services, it also poses serious privacy
issues for users as centralized storage makes private personal speech data
vulnerable to cyber threats. With the increasing use of voice-based digital
assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the
increasing ease with which personal speech data can be collected, the risk of
malicious use of voice-cloning and speaker/gender/pathological/etc. recognition
has increased.
This thesis proposes solutions for anonymizing speech and evaluating the
degree of the anonymization. In this work, anonymization refers to making
personal speech data unlinkable to an identity while maintaining the usefulness
(utility) of the speech signal (e.g., access to linguistic content). We start
by identifying several challenges that evaluation protocols need to consider to
evaluate the degree of privacy protection properly. We clarify how
anonymization systems must be configured for evaluation purposes and highlight
that many practical deployment configurations do not permit privacy evaluation.
Furthermore, we study and examine the most common voice conversion-based
anonymization system and identify its weak points before suggesting new methods
to overcome some limitations. We isolate all components of the anonymization
system to evaluate the degree of speaker PPI associated with each of them.
Then, we propose several transformation methods for each component to reduce as
much as possible speaker PPI while maintaining utility. We promote
anonymization algorithms based on quantization-based transformation as an
alternative to the most-used and well-known noise-based approach. Finally, we
endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy |
for associated source code, see https://github.com/deep-privacy/SA-toolki
Learning Feature Representation for Automatic Speech Recognition
Feature extraction in automatic speech recognition (ASR) can be regarded
as learning representations from lower-level to more abstract higher-level features.
Lower-level feature can be viewed as features from the signal domain,
such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients
(MFCCs) features. Higher-level feature representations can be considered
as bottleneck features (BNFs) learned using deep neural networks
(DNNs). In this thesis, we focus on improving feature extraction at different
levels mainly for ASR.
The first part of this thesis focuses on learning features from the signal
domain that help ASR. Hand-crafted spectral and cepstral features such as
MFCC are the main features used in most conventional ASR systems; all are
inspired by physiological models of the human auditory system. However, some
aspects of the signal such as pitch cannot be easily extracted from spectral
features, but are found to be useful for ASR. We explore new algorithm to extract
a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages,
especially tonal languages.
We then investigate replacing the conventional features with jointly training
from the signal domain using time domain, and frequency domain approaches.
The results show that our time-domain joint feature learning setup
achieves state-of-the-art performance using MFCC, while our frequency domain
setup outperforms them in various datasets.
Joint feature extraction results in learning data or language-dependent filter
banks, that can degrade the performance in unseen noise and channel conditions
or other languages. To tackle this, we investigate joint universal feature
learning across different languages using the proposed direct-from-signal
setups. We then investigate the filter banks learned in this setup and propose
a new set of features as an extension to conventional Mel filter banks. The results
show consistent word error rate (WER) improvement, especially in clean
condition.
The second part of this thesis focuses on learning higher-level feature embedding.
We investigate learning and transferring deep feature representations
across different domains using multi-task learning and weight transfer
approaches. They have been adopted to explicitly learn intermediate-level features
that are useful for several different tasks