40 research outputs found

    A glottal chink model for the synthesis of voiced fricatives

    Get PDF
    International audienceThis paper presents a simulation framework that enables a glottal chink model to be integrated into a time-domain continuous speech synthesizer along with self-oscillating vocal folds. The glottis is then made up of two main separated components: a self-oscillating part and a constantly open chink. This feature allows the simulation of voiced fricatives, thanks to a self-oscillating model of the vocal folds to generate the voiced source, and the glottal opening that is necessary to generate the frication noise. Numerical simulations show the accuracy of the model to simulate voiced fricative, and also phonetic assimilation, such as sonorization and devoicing. The simulation framework is also used to show that the phonatory/articulatory space for generating voiced fricatives is different according to the desired sound: for instance, the minimal glottal opening for generating frica-tion noise is shorter for /z/ than for /Z/

    Glottal Opening and Strategies of Production of Fricatives

    Get PDF
    International audienceThis work investigates the influence of the gradual opening of the glottis along its length during the production of fricatives in intervocalic contexts. Acoustic simulations reveal the existence of a transient zone in the articulatory space where the frica-tion noise level is very sensitive to small perturbations of the glottal opening. This corresponds to the configurations where both frication noise and voiced contributions are present in the speech signal. To avoid this unstability, speakers may adopt different strategies to ensure the voiced/voiceless contrast of frica-tives. This is evidenced by experimental data of simultaneous glottal opening measurements, performed with ePGG, and audio recordings of vowel-fricative-vowel pseudowords. Voice-less fricatives are usually longer, in order to maximize the number of voiceless time frames over voiced frames due to the crossing of the transient regime. For voiced fricatives, the speaker may avoid the unstable regime by keeping low frication noise level, and thus by favoring the voicing characteristic, or by doing very short crossings into the unstable regime. It is also shown that when speakers are asked to sustain voiced fricatives longer than in natural speech, they adopt the strategy of keeping low frication noise level to avoid the unstable regime

    Extension of the single-matrix formulation of the vocal tract: consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink

    Get PDF
    International audienceThe paper presents extensions of the single-matrix formulation (Mokhtari et al., 2008, Speech Comm. 50(3) 179 – 190) that enable self-oscillation models of vocal folds, including glottal chink, to be connected to the vocal tract. They also integrate the case of a local division of the main air path into two lateral channels, as it may occur during the production of lateral consonants. Provided extensions are detailed by a reformulation of the acoustic conditions at the glottis, and at the upstream and downstream connections of bilateral channels. The simulation framework is validated through numerical simulations. The introduction of an antiresonance in the transfer function due to the presence of asymmetric bilateral channels is confirmed by the simulations. The frequency of the antiresonance agrees with the theoretical predictions. Simulations of static vowels reveal that the behavior of the vocal folds is qualitatively similar whether they are connected to the single-matrix formulation or to the classic reflection-type line analog model. Finally, the acoustic effect of the glottal chink on the production of vowels is highlighted by the simulations: the shortening of the vibrating part of the vocal folds lowers the amplitude of the glottal flow, and therefore lowers the global acoustic level radiated at the lips. It also introduces an offset in the glottal flow waveform

    Copy synthesis of running speech based on vocal tract imaging and audio recording

    Get PDF
    International audienceThis study presents a simulation framework to synthesize running speech from information obtained from simultaneous vocat tract imaging and audio recording. The aim is to numerically simulate the acoustic and mechanical phenomena that occur during speech production given the actual articulatory gestures of the speaker, so that the simulated speech reproduces the original acoustic features (formant trajectories, prosody, segmentic phonation, etc). The result is intended to be a copy of the original speech signal, hence the name copy synthesis. The shape of the vocal tract is extracted from 2D midsagittal views of the vocal tract acquired at a sufficient framerate to get a few images per produced phone. The area functions of the vocal tract are then anatomically realistic, and also account for side cavities. The acoustic simulation framework uses an extended version of the single-matrix formulation that enables a self-oscillating model of the vocal folds with a glottal chink to be connected to the time-varying waveguide network that models the vocal tract. Copy synthesis of a few French sentences shows the accuracy of the simulation framework to reproduce acoustic cues of natural phrase-level utterances containing most of French natural classes while considering the real geometric shape of the speaker. This is intended to be used as a tool to relate the acoustic features of speech to their articulatory or phonatory origins

    Acoustic impact of the gradual glottal abduction on the production of fricatives: A numerical study

    Get PDF
    International audienceThe paper presents a numerical study about the acoustic impact of the glottal chink opening on the production of fricatives. Sustained fricatives are simulated by using classic lumped circuit element methods to compute the propagation of the acoustic wave along the vocal tract. A recent glottis model is connected to the wave solver to simulate a partial abduction of the vocal folds during their self-oscillating cycles. Area functions of fricatives at the three places of articulation of French (palato-alveolar, alveolar, and labiodental) have been extracted from static MRI acquisitions. Simulations highlight the existence of three distinct regimes, named A, B, and C, depending on the chink opening. They are characterized by the frication noise level: A exhibits a low frication noise level, B is a mixed noise/voice signal, and C contains only frication noise. They have significant impacts on the first spectral moments. Boundaries of these regimes are defined in terms of minimal abduction of the vocal folds, and simulations show that they depend on articulatory and glottal configurations. Regime B is shown to be unstable: it requires very specific configurations in comparison with other regimes, and acoustic features are very sensitive in this regime

    Copy synthesis of phrase-level utterances

    Get PDF
    International audience—This paper presents a simulation framework for synthesizing speech from anatomically realistic data of the vocal tract. The acoustic propagation paradigm is appropriately chosen so that it can deal with complex geometries and a time-varying length of the vocal tract. The glottal source model designed in this paper allows partial closure of the glottis by branching a posterior chink in parallel to a classic lumped mass-spring model of the vocal folds. Temporal scenarios for the dynamic shapes of the vocal tract and the glottal configurations may be derived from the simultaneous acquisition of X-ray images and audio recording. Copy synthesis of a few French sentences shows the accuracy of the simulation framework to reproduce acoustic cues of natural phrase-level utterances containing most of French natural classes while considering the real geometric shape of the speaker

    Self-Supervised Solution to the Control Problem of Articulatory Synthesis

    Get PDF
    Given an articulatory-to-acoustic forward model, it is a priori unknown how its motor control must be operated to achieve a desired acoustic result. This control problem is a fundamental issue of articulatory speech synthesis and the cradle of acousticto-articulatory inversion, a discipline which attempts to address the issue by the means of various methods. This work presents an end-to-end solution to the articulatory control problem, in which synthetic motor trajectories of Monte-Carlo-generated artificial speech are linked to input modalities (such as natural speech recordings or phoneme sequence input) via speakerindependent latent representations of a vector-quantized variational autoencoder. The proposed method is self-supervised and thus, in principle, synthesizer and speaker model independent

    Time-Varying Modeling of Glottal Source and Vocal Tract and Sequential Bayesian Estimation of Model Parameters for Speech Synthesis

    Get PDF
    abstract: Speech is generated by articulators acting on a phonatory source. Identification of this phonatory source and articulatory geometry are individually challenging and ill-posed problems, called speech separation and articulatory inversion, respectively. There exists a trade-off between decomposition and recovered articulatory geometry due to multiple possible mappings between an articulatory configuration and the speech produced. However, if measurements are obtained only from a microphone sensor, they lack any invasive insight and add additional challenge to an already difficult problem. A joint non-invasive estimation strategy that couples articulatory and phonatory knowledge would lead to better articulatory speech synthesis. In this thesis, a joint estimation strategy for speech separation and articulatory geometry recovery is studied. Unlike previous periodic/aperiodic decomposition methods that use stationary speech models within a frame, the proposed model presents a non-stationary speech decomposition method. A parametric glottal source model and an articulatory vocal tract response are represented in a dynamic state space formulation. The unknown parameters of the speech generation components are estimated using sequential Monte Carlo methods under some specific assumptions. The proposed approach is compared with other glottal inverse filtering methods, including iterative adaptive inverse filtering, state-space inverse filtering, and the quasi-closed phase method.Dissertation/ThesisMasters Thesis Electrical Engineering 201

    Artificial Vocal Learning guided by Phoneme Recognition and Visual Information

    Get PDF
    This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain

    Optimization and automation of relative fundamental frequency for objective assessment of vocal hyperfunction

    Full text link
    The project objective is to improve clinical assessment and diagnosis of the voice disorder, vocal hyperfunction (VH). VH is a condition characterized by excessive laryngeal and paralaryngeal tension, and is assumed to be the underlying cause of the majority of voice disorders. Current clinical assessment of VH is subjective and demonstrates poor inter-rater reliability. Recent work indicates that a new acoustic measure, relative fundamental frequency (RFF) is sensitive to the maladaptive functional behaviors associated with VH and can potentially be used to objectively characterize VH. Here, we explored and enhanced the potential for RFF as a measure of VH in three ways. First, the current protocol for RFF estimation was optimized to simplify the recording procedure and reduce estimation time. Second, RFF was compared with the current state-of-the-art measures of VH – listener perception of vocal effort and the aerodynamic ratio of sound pressure level to subglottal pressure level. Third, an automated algorithm that utilized the optimized recording protocol was developed and validated against manual estimation methods and listener perception. This work enables large-scale studies on RFF to determine the specific physiological elements that contribute to the measure’s ability to capture VH and may potentially provide a non-invasive and readily implemented solution for this long-standing clinical issue
    corecore