779 research outputs found

    Intraspeaker Comparisons of Acoustic and Articulatory Variability in American English /r/ Productions

    Full text link
    The purpose of this report is to test the hypothesis that speakers utilize an acoustic, rather than articulatory, planning space for speech production. It has been well-documented that many speakers of American English use different tongue configurations to produce /r/ in different phonetic contexts. The acoustic planning hypothesis suggests that although the /r/ configuration varies widely in different contexts, the primary acoustic cue for /r/, a dip in the F3 trajectory, will be less variable due to tradeoffs in articulatory variability, or trading relations, that help maintain a relatively constant F3 trajectory across phonetic contexts. Acoustic data and EMMA articulatory data from seven speakers producing /r/ in different phonetic contexts were analyzed. Visual inspection of the EMMA data at the point of F3 minimum revealed that each speaker appeared to use at least two of three trading relation strategies that would be expected to reduce F3 variability. Articulatory covariance measures confirmed that all seven speakers utilized a trading relation between tongue back height and tongue back horizontal position, six speakers utilized a trading relation between tongue tip height and tongue back height, and the speaker who did not use this latter strategy instead utilized a trading relation between tongue tip height and tongue back horizontal position. Estimates of F3 variability with and without the articulatory covariances indicated that F3 would be much higher for all speakers if the articulatory covariances were not utilized. These conclusions were further supported by a comparison of measured F3 variability to F3 variabilities estimated from the pellet data with and without articulatory covariances. In all subjects, the actual F3 variance was significantly lower than the F3 variance estimated without articulatory covariances, further supporting the conclusion that the articulatory trading relations were being used to reduce F3 variability. Together, these results strongly suggest that the neural control mechanisms underlying speech production make elegant use of trading relations between articulators to maintain a relatively invariant acoustic trace for /r/ across phonetic contexts

    Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation

    Get PDF
    Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision

    Intraspeaker Comparisons of Acoustic and Articulatory Variability in American English /r/ Productions

    Full text link
    The purpose of this report is to test the hypothesis that speakers utilize an acoustic, rather than articulatory, planning space for speech production. It has been well-documented that many speakers of American English use different tongue configurations to produce /r/ in different phonetic contexts. The acoustic planning hypothesis suggests that although the /r/ configuration varies widely in different contexts, the primary acoustic cue for /r/, a dip in the F3 trajectory, will be less variable due to tradeoffs in articulatory variability, or trading relations, that help maintain a relatively constant F3 trajectory across phonetic contexts. Acoustic data and EMMA articulatory data from seven speakers producing /r/ in different phonetic contexts were analyzed. Visual inspection of the EMMA data at the point of F3 minimum revealed that each speaker appeared to use at least two of three trading relation strategies that would be expected to reduce F3 variability. Articulatory covariance measures confirmed that all seven speakers utilized a trading relation between tongue back height and tongue back horizontal position, six speakers utilized a trading relation between tongue tip height and tongue back height, and the speaker who did not use this latter strategy instead utilized a trading relation between tongue tip height and tongue back horizontal position. Estimates of F3 variability with and without the articulatory covariances indicated that F3 would be much higher for all speakers if the articulatory covariances were not utilized. These conclusions were further supported by a comparison of measured F3 variability to F3 variabilities estimated from the pellet data with and without articulatory covariances. In all subjects, the actual F3 variance was significantly lower than the F3 variance estimated without articulatory covariances, further supporting the conclusion that the articulatory trading relations were being used to reduce F3 variability. Together, these results strongly suggest that the neural control mechanisms underlying speech production make elegant use of trading relations between articulators to maintain a relatively invariant acoustic trace for /r/ across phonetic contexts

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Speech vocoding for laboratory phonology

    Get PDF
    Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

    Acoustic modeling using the digital waveguide mesh

    Get PDF
    The digital waveguide mesh has been an active area of music acoustics research for over ten years. Although founded in 1-D digital waveguide modeling, the principles on which it is based are not new to researchers grounded in numerical simulation, FDTD methods, electromagnetic simulation, etc. This article has attempted to provide a considerable review of how the DWM has been applied to acoustic modeling and sound synthesis problems, including new 2-D object synthesis and an overview of recent research activities in articulatory vocal tract modeling, RIR synthesis, and reverberation simulation. The extensive, although not by any means exhaustive, list of references indicates that though the DWM may have parallels in other disciplines, it still offers something new in the field of acoustic simulation and sound synth

    Phone-based speech synthesis using neural network with articulatory control.

    Get PDF
    by Lo Wai Kit.Thesis (M.Phil.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 151-160).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Applications of Speech Synthesis --- p.2Chapter 1.1.1 --- Human Machine Interface --- p.2Chapter 1.1.2 --- Speech Aids --- p.3Chapter 1.1.3 --- Text-To-Speech (TTS) system --- p.4Chapter 1.1.4 --- Speech Dialogue System --- p.4Chapter 1.2 --- Current Status in Speech Synthesis --- p.6Chapter 1.2.1 --- Concatenation Based --- p.6Chapter 1.2.2 --- Parametric Based --- p.7Chapter 1.2.3 --- Articulatory Based --- p.7Chapter 1.2.4 --- Application of Neural Network in Speech Synthesis --- p.8Chapter 1.3 --- The Proposed Neural Network Speech Synthesis --- p.9Chapter 1.3.1 --- Motivation --- p.9Chapter 1.3.2 --- Objectives --- p.9Chapter 1.4 --- Thesis outline --- p.11Chapter 2 --- Linguistic Basics for Speech Synthesis --- p.12Chapter 2.1 --- Relations between Linguistic and Speech Synthesis --- p.12Chapter 2.2 --- Basic Phonology and Phonetics --- p.14Chapter 2.2.1 --- Phonology --- p.14Chapter 2.2.2 --- Phonetics --- p.15Chapter 2.2.3 --- Prosody --- p.16Chapter 2.3 --- Transcription Systems --- p.17Chapter 2.3.1 --- The Employed Transcription System --- p.18Chapter 2.4 --- Cantonese Phonology --- p.20Chapter 2.4.1 --- Some Properties of Cantonese --- p.20Chapter 2.4.2 --- Initial --- p.21Chapter 2.4.3 --- Final --- p.23Chapter 2.4.4 --- Lexical Tone --- p.25Chapter 2.4.5 --- Variations --- p.26Chapter 2.5 --- The Vowel Quadrilaterals --- p.29Chapter 3 --- Speech Synthesis Technology --- p.32Chapter 3.1 --- The Human Speech Production --- p.32Chapter 3.2 --- Important Issues in Speech Synthesis System --- p.34Chapter 3.2.1 --- Controllability --- p.34Chapter 3.2.2 --- Naturalness --- p.34Chapter 3.2.3 --- Complexity --- p.35Chapter 3.2.4 --- Information Storage --- p.35Chapter 3.3 --- Units for Synthesis --- p.37Chapter 3.4 --- Type of Synthesizer --- p.40Chapter 3.4.1 --- Copy Concatenation --- p.40Chapter 3.4.2 --- Vocoder --- p.41Chapter 3.4.3 --- Articulatory Synthesis --- p.44Chapter 4 --- Neural Network Speech Synthesis with Articulatory Control --- p.47Chapter 4.1 --- Neural Network Approximation --- p.48Chapter 4.1.1 --- The Approximation Problem --- p.48Chapter 4.1.2 --- Network Approach for Approximation --- p.49Chapter 4.2 --- Artificial Neural Network for Phone-based Speech Synthesis --- p.53Chapter 4.2.1 --- Network Approximation for Speech Signal Synthesis --- p.53Chapter 4.2.2 --- Feed forward Backpropagation Neural Network --- p.56Chapter 4.2.3 --- Radial Basis Function Network --- p.58Chapter 4.2.4 --- Parallel Operating Synthesizer Networks --- p.59Chapter 4.3 --- Template Storage and Control for the Synthesizer Network --- p.61Chapter 4.3.1 --- Implicit Template Storage --- p.61Chapter 4.3.2 --- Articulatory Control Parameters --- p.61Chapter 4.4 --- Summary --- p.65Chapter 5 --- Prototype Implementation of the Synthesizer Network --- p.66Chapter 5.1 --- Implementation of the Synthesizer Network --- p.66Chapter 5.1.1 --- Network Architectures --- p.68Chapter 5.1.2 --- Spectral Templates for Training --- p.74Chapter 5.1.3 --- System requirement --- p.76Chapter 5.2 --- Subjective Listening Test --- p.79Chapter 5.2.1 --- Sample Selection --- p.79Chapter 5.2.2 --- Test Procedure --- p.81Chapter 5.2.3 --- Result --- p.83Chapter 5.2.4 --- Analysis --- p.86Chapter 5.3 --- Summary --- p.88Chapter 6 --- Simplified Articulatory Control for the Synthesizer Network --- p.89Chapter 6.1 --- Coarticulatory Effect in Speech Production --- p.90Chapter 6.1.1 --- Acoustic Effect --- p.90Chapter 6.1.2 --- Prosodic Effect --- p.91Chapter 6.2 --- Control in various Synthesis Techniques --- p.92Chapter 6.2.1 --- Copy Concatenation --- p.92Chapter 6.2.2 --- Formant Synthesis --- p.93Chapter 6.2.3 --- Articulatory synthesis --- p.93Chapter 6.3 --- Articulatory Control Model based on Vowel Quad --- p.94Chapter 6.3.1 --- Modeling of Variations with the Articulatory Control Model --- p.95Chapter 6.4 --- Voice Correspondence : --- p.97Chapter 6.4.1 --- For Nasal Sounds ´ؤ Inter-Network Correspondence --- p.98Chapter 6.4.2 --- In Flat-Tongue Space - Intra-Network Correspondence --- p.101Chapter 6.5 --- Summary --- p.108Chapter 7 --- Pause Duration Properties in Cantonese Phrases --- p.109Chapter 7.1 --- The Prosodic Feature - Inter-Syllable Pause --- p.110Chapter 7.2 --- Experiment for Measuring Inter-Syllable Pause of Cantonese Phrases --- p.111Chapter 7.2.1 --- Speech Material Selection --- p.111Chapter 7.2.2 --- Experimental Procedure --- p.112Chapter 7.2.3 --- Result --- p.114Chapter 7.3 --- Characteristics of Inter-Syllable Pause in Cantonese Phrases --- p.117Chapter 7.3.1 --- Pause Duration Characteristics for Initials after Pause --- p.117Chapter 7.3.2 --- Pause Duration Characteristic for Finals before Pause --- p.119Chapter 7.3.3 --- General Observations --- p.119Chapter 7.3.4 --- Other Observations --- p.121Chapter 7.4 --- Application of Pause-duration Statistics to the Synthesis System --- p.124Chapter 7.5 --- Summary --- p.126Chapter 8 --- Conclusion and Further Work --- p.127Chapter 8.1 --- Conclusion --- p.127Chapter 8.2 --- Further Extension Work --- p.130Chapter 8.2.1 --- Regularization Network Optimized on ISD --- p.130Chapter 8.2.2 --- Incorporation of Non-Articulatory Parameters to Control Space --- p.130Chapter 8.2.3 --- Experiment on Other Prosodic Features --- p.131Chapter 8.2.4 --- Application of Voice Correspondence to Cantonese Coda Discrim- ination --- p.131Chapter A --- Cantonese Initials and Finals --- p.132Chapter A.1 --- Tables of All Cantonese Initials and Finals --- p.132Chapter B --- Using Distortion Measure as Error Function in Neural Network --- p.135Chapter B.1 --- Formulation of Itakura-Saito Distortion Measure for Neural Network Error Function --- p.135Chapter B.2 --- Formulation of a Modified Itakura-Saito Distortion (MISD) Measure for Neural Network Error Function --- p.137Chapter C --- Orthogonal Least Square Algorithm for RBFNet Training --- p.138Chapter C.l --- Orthogonal Least Squares Learning Algorithm for Radial Basis Function Network Training --- p.138Chapter D --- Phrase Lists --- p.140Chapter D.1 --- Two-Syllable Phrase List for the Pause Duration Experiment --- p.140Chapter D.1.1 --- 兩字詞 --- p.140Chapter D.2 --- Three/Four-Syllable Phrase List for the Pause Duration Experiment --- p.144Chapter D.2.1 --- 片語 --- p.14

    Acoustic characterization of the glides /j/ and /w/ in American English

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 141-145).Acoustic analyses were conducted to identify the characteristics that differentiate the glides /j,w/ from adjacent vowels. These analyses were performed on a recorded database of intervocalic glides, produced naturally by two male and two female speakers in controlled vocalic and prosodic contexts. Glides were found to differ significantly from adjacent vowels through RMS amplitude reduction, first formant frequency reduction, open quotient increase, harmonics-to-noise ratio reduction, and fundamental frequency reduction. The acoustic data suggest that glides differ from their cognate high vowels /i,u/ in that the glides are produced with a greater degree of constriction in the vocal tract. The narrower constriction causes an increase in oral pressure, which produces aerodynamic effects on the glottal voicing source. This interaction between the vocal tract filter and its excitation source results in skewing of the glottal waveform, increasing its open quotient and decreasing the amplitude of voicing. A listening experiment with synthetic tokens was performed to isolate and compare the perceptual salience of acoustic cues to the glottal source effects of glides and to the vocal tract configuration itself. Voicing amplitude (representing source effects) and first formant frequency (representing filter configuration) were manipulated in cooperating and conflicting patterns to create percepts of /V#V/ or /V#GV/ sequences, where Vs were high vowels and Gs were their cognate glides.(cont.) In the responses of ten naïve subjects, voicing amplitude had a greater effect on the detection of glides than first formant frequency, suggesting that glottal source effects are more important to the distinction between glides and high vowels. The results of the acoustic and perceptual studies provide evidence for an articulatory-acoustic mapping defining the glide category. It is suggested that glides are differentiated from high vowels and fricatives by articulatory-acoustic boundaries related to the aerodynamic consequences of different degrees of vocal tract constriction. The supraglottal constriction target for glides is sufficiently narrow to produce a non-vocalic oral pressure drop, but not sufficiently narrow to produce a significant frication noise source. This mapping is consistent with the theory that articulator-free features are defined by aero-mechanical interactions. Implications for phonological classification systems and speech technology applications are discussed.by Elisabeth Hon Hunt.Ph.D

    Adding expressiveness to unit selection speech synthesis and to numerical voice production

    Get PDF
    La parla és una de les formes de comunicació més naturals i directes entre éssers humans, ja que codifica un missatge i també claus paralingüístiques sobre l’estat emocional del locutor, el to o la seva intenció, esdevenint així fonamental en la consecució d’una interacció humà-màquina (HCI) més natural. En aquest context, la generació de parla expressiva pel canal de sortida d’HCI és un element clau en el desenvolupament de tecnologies assistencials o assistents personals entre altres aplicacions. La parla sintètica pot ser generada a partir de parla enregistrada utilitzant mètodes basats en corpus com la selecció d’unitats (US), que poden aconseguir resultats d’alta qualitat però d’expressivitat restringida a la pròpia del corpus. A fi de millorar la qualitat de la sortida de la síntesi, la tendència actual és construir bases de dades de veu cada cop més grans, seguint especialment l’aproximació de síntesi anomenada End-to-End basada en tècniques d’aprenentatge profund. Tanmateix, enregistrar corpus ad-hoc per cada estil expressiu desitjat pot ser extremadament costós o fins i tot inviable si el locutor no és capaç de realitzar adequadament els estils requerits per a una aplicació donada (ex: cant en el domini de la narració de contes). Alternativament, nous mètodes basats en la física de la producció de veu s’han desenvolupat a la darrera dècada gràcies a l’increment en la potència computacional. Per exemple, vocals o diftongs poden ser obtinguts utilitzant el mètode d’elements finits (FEM) per simular la propagació d’ones acústiques a través d’una geometria 3D realista del tracte vocal obtinguda a partir de ressonàncies magnètiques (MRI). Tanmateix, atès que els principals esforços en aquests mètodes de producció numèrica de veu s’han focalitzat en la millora del modelat del procés de generació de veu, fins ara s’ha prestat poca atenció a la seva expressivitat. A més, la col·lecció de dades per aquestes simulacions és molt costosa, a més de requerir un llarg postprocessament manual com el necessari per extreure geometries 3D del tracte vocal a partir de MRI. L’objectiu de la tesi és afegir expressivitat en un sistema que genera veu neutra, sense haver d’adquirir dades expressives del locutor original. Per un costat, s’afegeixen capacitats expressives a un sistema de conversió de text a parla basat en selecció d’unitats (US-TTS) dotat d’un corpus de veu neutra, per adreçar necessitats específiques i concretes en l’àmbit de la narració de contes, com són la veu cantada o situacions de suspens. A tal efecte, la veu és parametritzada utilitzant un model harmònic i transformada a l’estil expressiu desitjat d’acord amb un sistema expert. Es presenta una primera aproximació, centrada en la síntesi de suspens creixent per a la narració de contes, i es demostra la seva viabilitat pel que fa a naturalitat i qualitat de narració de contes. També s’afegeixen capacitats de cant al sistema US-TTS mitjançant la integració de mòduls de transformació de parla a veu cantada en el pipeline del TTS, i la incorporació d’un mòdul de generació de prosòdia expressiva que permet al mòdul de US seleccionar unitats més properes a la prosòdia cantada obtinguda a partir de la partitura d’entrada. Això resulta en un framework de síntesi de conversió de text a parla i veu cantada basat en selecció d’unitats (US-TTS&S) que pot generar veu parlada i cantada a partir d'un petit corpus de veu neutra (~2.6h). D’acord amb els resultats objectius, l’estratègia de US guiada per la partitura permet reduir els factors de modificació de pitch requerits per produir veu cantada a partir de les unitats de veu parlada seleccionades, però en canvi té una efectivitat limitada amb els factors de modificació de les durades degut a la curta durada de les vocals parlades neutres. Els resultats dels tests perceptius mostren que tot i òbviament obtenir una naturalitat inferior a la oferta per un sintetitzador professional de veu cantada, el framework pot adreçar necessitats puntuals de veu cantada per a la síntesis de narració de contes amb una qualitat raonable. La incorporació d’expressivitat s’investiga també en la simulació numèrica 3D de vocals basada en FEM mitjançant modificacions de les senyals d’excitació glotal utilitzant una aproximació font-filtre de producció de veu. Aquestes senyals es generen utilitzant un model Liljencrants-Fant (LF) controlat amb el paràmetre de forma del pols Rd, que permet explorar el continu de fonació lax-tens a més del rang de freqüències fonamentals, F0, de la veu parlada. S’analitza la contribució de la font glotal als modes d’alt ordre en la síntesis FEM de les vocals cardinals [a], [i] i [u] mitjançant la comparació dels valors d’energia d’alta freqüència (HFE) obtinguts amb geometries realistes i simplificades del tracte vocal. Les simulacions indiquen que els modes d’alt ordre es preveuen perceptivament rellevants d’acord amb valors de referència de la literatura, particularment per a fonacions tenses i/o F0s altes. En canvi, per a vocals amb una fonació laxa i/o F0s baixes els nivells d’HFE poden resultar inaudibles, especialment si no hi ha soroll d’aspiració en la font glotal. Després d’aquest estudi preliminar, s’han analitzat les característiques d’excitació de vocals alegres i agressives d’un corpus paral·lel de veu en castellà amb l’objectiu d’incorporar aquests estils expressius de veu tensa en la simulació numèrica de veu. Per a tal efecte, s’ha usat el vocoder GlottDNN per analitzar variacions d’F0 i pendent espectral relacionades amb l’excitació glotal en vocals [a]. Aquestes variacions es mapegen mitjançant la comparació amb vocals sintètiques en valors d’F0 i Rd per simular vocals que s’assemblin als estils alegre i agressiu. Els resultats mostren que és necessari incrementar l’F0 i disminuir l’Rd respecte la veu neutra, amb variacions majors per a alegre que per agressiu, especialment per a vocals accentuades. Els resultats aconseguits en les investigacions realitzades validen la possibilitat d’afegir expressivitat a la síntesi basada en corpus US-TTS i a la simulació numèrica de veu basada en FEM. Tanmateix, encara hi ha marge de millora. Per exemple, l’estratègia aplicada a la producció numèrica de veu es podria millorar estudiant i desenvolupant mètodes de filtratge invers així com incorporant modificacions del tracte vocal, mentre que el framework US-TTS&S es podria beneficiar dels avenços en tècniques de transformació de veu incloent transformacions de la qualitat de veu, aprofitant l’experiència adquirida en la simulació numèrica de vocals expressives.El habla es una de las formas de comunicación más naturales y directas entre seres humanos, ya que codifica un mensaje y también claves paralingüísticas sobre el estado emocional del locutor, el tono o su intención, convirtiéndose así en fundamental en la consecución de una interacción humano-máquina (HCI) más natural. En este contexto, la generación de habla expresiva para el canal de salida de HCI es un elemento clave en el desarrollo de tecnologías asistenciales o asistentes personales entre otras aplicaciones. El habla sintética puede ser generada a partir de habla gravada utilizando métodos basados en corpus como la selección de unidades (US), que pueden conseguir resultados de alta calidad, pero de expresividad restringida a la propia del corpus. A fin de mejorar la calidad de la salida de la síntesis, la tendencia actual es construir bases de datos de voz cada vez más grandes, siguiendo especialmente la aproximación de síntesis llamada End-to-End basada en técnicas de aprendizaje profundo. Sin embargo, gravar corpus ad-hoc para cada estilo expresivo deseado puede ser extremadamente costoso o incluso inviable si el locutor no es capaz de realizar adecuadamente los estilos requeridos para una aplicación dada (ej: canto en el dominio de la narración de cuentos). Alternativamente, nuevos métodos basados en la física de la producción de voz se han desarrollado en la última década gracias al incremento en la potencia computacional. Por ejemplo, vocales o diptongos pueden ser obtenidos utilizando el método de elementos finitos (FEM) para simular la propagación de ondas acústicas a través de una geometría 3D realista del tracto vocal obtenida a partir de resonancias magnéticas (MRI). Sin embargo, dado que los principales esfuerzos en estos métodos de producción numérica de voz se han focalizado en la mejora del modelado del proceso de generación de voz, hasta ahora se ha prestado poca atención a su expresividad. Además, la colección de datos para estas simulaciones es muy costosa, además de requerir un largo postproceso manual como el necesario para extraer geometrías 3D del tracto vocal a partir de MRI. El objetivo de la tesis es añadir expresividad en un sistema que genera voz neutra, sin tener que adquirir datos expresivos del locutor original. Per un lado, se añaden capacidades expresivas a un sistema de conversión de texto a habla basado en selección de unidades (US-TTS) dotado de un corpus de voz neutra, para abordar necesidades específicas y concretas en el ámbito de la narración de cuentos, como son la voz cantada o situaciones de suspense. Para ello, la voz se parametriza utilizando un modelo harmónico y se transforma al estilo expresivo deseado de acuerdo con un sistema experto. Se presenta una primera aproximación, centrada en la síntesis de suspense creciente para la narración de cuentos, y se demuestra su viabilidad en cuanto a naturalidad y calidad de narración de cuentos. También se añaden capacidades de canto al sistema US-TTS mediante la integración de módulos de transformación de habla a voz cantada en el pipeline del TTS, y la incorporación de un módulo de generación de prosodia expresiva que permite al módulo de US seleccionar unidades más cercanas a la prosodia cantada obtenida a partir de la partitura de entrada. Esto resulta en un framework de síntesis de conversión de texto a habla y voz cantada basado en selección de unidades (US-TTS&S) que puede generar voz hablada y cantada a partir del mismo pequeño corpus de voz neutra (~2.6h). De acuerdo con los resultados objetivos, la estrategia de US guiada por la partitura permite reducir los factores de modificación de pitch requeridos para producir voz cantada a partir de las unidades de voz hablada seleccionadas, pero en cambio tiene una efectividad limitada con los factores de modificación de duraciones debido a la corta duración de las vocales habladas neutras. Los resultados de las pruebas perceptivas muestran que, a pesar de obtener una naturalidad obviamente inferior a la ofrecida por un sintetizador profesional de voz cantada, el framework puede abordar necesidades puntuales de voz cantada para la síntesis de narración de cuentos con una calidad razonable. La incorporación de expresividad se investiga también en la simulación numérica 3D de vocales basada en FEM mediante modificaciones en las señales de excitación glotal utilizando una aproximación fuente-filtro de producción de voz. Estas señales se generan utilizando un modelo Liljencrants-Fant (LF) controlado con el parámetro de forma del pulso Rd, que permite explorar el continuo de fonación laxo-tenso además del rango de frecuencias fundamentales, F0, de la voz hablada. Se analiza la contribución de la fuente glotal a los modos de alto orden en la síntesis FEM de las vocales cardinales [a], [i] y [u] mediante la comparación de los valores de energía de alta frecuencia (HFE) obtenidos con geometrías realistas y simplificadas del tracto vocal. Las simulaciones indican que los modos de alto orden se prevén perceptivamente relevantes de acuerdo con valores de referencia de la literatura, particularmente para fonaciones tensas y/o F0s altas. En cambio, para vocales con una fonación laxa y/o F0s bajas los niveles de HFE pueden resultar inaudibles, especialmente si no hay ruido de aspiración en la fuente glotal. Después de este estudio preliminar, se han analizado las características de excitación de vocales alegres y agresivas de un corpus paralelo de voz en castellano con el objetivo de incorporar estos estilos expresivos de voz tensa en la simulación numérica de voz. Para ello, se ha usado el vocoder GlottDNN para analizar variaciones de F0 y pendiente espectral relacionadas con la excitación glotal en vocales [a]. Estas variaciones se mapean mediante la comparación con vocales sintéticas en valores de F0 y Rd para simular vocales que se asemejen a los estilos alegre y agresivo. Los resultados muestran que es necesario incrementar la F0 y disminuir la Rd respecto la voz neutra, con variaciones mayores para alegre que para agresivo, especialmente para vocales acentuadas. Los resultados conseguidos en las investigaciones realizadas validan la posibilidad de añadir expresividad a la síntesis basada en corpus US-TTS y a la simulación numérica de voz basada en FEM. Sin embargo, hay margen de mejora. Por ejemplo, la estrategia aplicada a la producción numérica de voz se podría mejorar estudiando y desarrollando métodos de filtrado inverso, así como incorporando modificaciones del tracto vocal, mientras que el framework US-TTS&S desarrollado se podría beneficiar de los avances en técnicas de transformación de voz incluyendo transformaciones de la calidad de la voz, aprovechando la experiencia adquirida en la simulación numérica de vocales expresivas.Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications. Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI. The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality. The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels. The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels

    On the quality of synthetic speech : evaluation and improvements

    Get PDF
    • …
    corecore