14 research outputs found

    Model-based Parametric Prosody Synthesis with Deep Neural Network

    Get PDF
    Conventional statistical parametric speech synthesis (SPSS) captures only frame-wise acoustic observations and computes probability densities at HMM state level to obtain statistical acoustic models combined with decision trees, which is therefore a purely statistical data-driven approach without explicit integration of any articulatory mechanisms found in speech production research. The present study explores an alternative paradigm, namely, model-based parametric prosody synthesis (MPPS), which integrates dynamic mechanisms of human speech production as a core component of F0 generation. In this paradigm, contextual variations in prosody are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. Here the motor model is target approximation (TA), which generates syllable-sized F0 contours with only three motor parameters that are associated to linguistic functions. In this study, we simulate this two-stage process by linking the TA model to a deep neural network (DNN), which learns the “linguistic-motor” mapping given the “motor-acoustic” mapping provided by TA-based syllable-wise F0 production. The proposed prosody modeling system outperforms the HMM-based baseline system in both objective and subjective evaluations

    Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

    Get PDF
    Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis

    Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

    Get PDF
    Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks

    Prosody generation for text-to-speech synthesis

    Get PDF
    The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, we propose a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. Later, we investigate the benefits of including long-range dependencies in duration prediction at frame-level using uni-directional recurrent neural networks. Since prosody is a supra-segmental property, we consider an alternate approach to intonation generation which exploits long-term dependencies of F0 by effective modelling of linguistic features using recurrent neural networks. For this purpose, we propose a hierarchical encoder-decoder and multi-resolution parallel encoder where the encoder takes word and higher level linguistic features at the input and upsamples them to phone-level through a series of hidden layers and is integrated into a Hybrid system which is then submitted to Blizzard challenge workshop. We then highlight some of the issues in current approaches and a plan for future directions of investigation is outlined along with on-going work

    Autoregressive neural F0 model for statistical parametric speech synthesis

    Get PDF

    Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

    Full text link
    The growing use of voice user interfaces has led to a surge in the collection and storage of speech data. While data collection allows for the development of efficient tools powering most speech services, it also poses serious privacy issues for users as centralized storage makes private personal speech data vulnerable to cyber threats. With the increasing use of voice-based digital assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the increasing ease with which personal speech data can be collected, the risk of malicious use of voice-cloning and speaker/gender/pathological/etc. recognition has increased. This thesis proposes solutions for anonymizing speech and evaluating the degree of the anonymization. In this work, anonymization refers to making personal speech data unlinkable to an identity while maintaining the usefulness (utility) of the speech signal (e.g., access to linguistic content). We start by identifying several challenges that evaluation protocols need to consider to evaluate the degree of privacy protection properly. We clarify how anonymization systems must be configured for evaluation purposes and highlight that many practical deployment configurations do not permit privacy evaluation. Furthermore, we study and examine the most common voice conversion-based anonymization system and identify its weak points before suggesting new methods to overcome some limitations. We isolate all components of the anonymization system to evaluate the degree of speaker PPI associated with each of them. Then, we propose several transformation methods for each component to reduce as much as possible speaker PPI while maintaining utility. We promote anonymization algorithms based on quantization-based transformation as an alternative to the most-used and well-known noise-based approach. Finally, we endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy | for associated source code, see https://github.com/deep-privacy/SA-toolki

    Learning Feature Representation for Automatic Speech Recognition

    Get PDF
    Feature extraction in automatic speech recognition (ASR) can be regarded as learning representations from lower-level to more abstract higher-level features. Lower-level feature can be viewed as features from the signal domain, such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients (MFCCs) features. Higher-level feature representations can be considered as bottleneck features (BNFs) learned using deep neural networks (DNNs). In this thesis, we focus on improving feature extraction at different levels mainly for ASR. The first part of this thesis focuses on learning features from the signal domain that help ASR. Hand-crafted spectral and cepstral features such as MFCC are the main features used in most conventional ASR systems; all are inspired by physiological models of the human auditory system. However, some aspects of the signal such as pitch cannot be easily extracted from spectral features, but are found to be useful for ASR. We explore new algorithm to extract a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages, especially tonal languages. We then investigate replacing the conventional features with jointly training from the signal domain using time domain, and frequency domain approaches. The results show that our time-domain joint feature learning setup achieves state-of-the-art performance using MFCC, while our frequency domain setup outperforms them in various datasets. Joint feature extraction results in learning data or language-dependent filter banks, that can degrade the performance in unseen noise and channel conditions or other languages. To tackle this, we investigate joint universal feature learning across different languages using the proposed direct-from-signal setups. We then investigate the filter banks learned in this setup and propose a new set of features as an extension to conventional Mel filter banks. The results show consistent word error rate (WER) improvement, especially in clean condition. The second part of this thesis focuses on learning higher-level feature embedding. We investigate learning and transferring deep feature representations across different domains using multi-task learning and weight transfer approaches. They have been adopted to explicitly learn intermediate-level features that are useful for several different tasks
    corecore