2,455 research outputs found

    The Effect of Timbre and Vibrato on Vocal Pitch Matching Accuracy

    Get PDF
    Research has shown that singers are better able to match pitch when the target stimulus has a timbre close to their own voice. This study seeks to answer the following questions: 1. Do classically trained female singers more accurately match pitch when the target stimulus is more similar to their own timbre? 2. Does the ability to match pitch vary with increasing pitch? 3. Does the ability to match pitch differ depending on whether the target stimulus is produced with or without vibrato? 4. Are mezzo sopranos less accurate than sopranos? Stimuli: Source signals were synthesized with a source slope of -12dB/octave using vibrato and without vibrato at each of the frequencies, C4, B4 and F5. These source signals were filtered using 5 formant patterns (A-E) of vowel /a/ constituting a total of 30 stimuli (5 formant patterns*3pitches*2 vibrato conditions). Procedure: Ten sopranos and 10 mezzo-sopranos with at least 3 years of individual voice training were recruited from the University Of Tennessee School Of Music and the Knoxville Opera Company. Each singer attempted to vocally match the pitch of all 30 stimuli presented twice in a random order. Analysis and results: Pitch matching accuracy was measured in terms of the difference in cents between the target and the experimental productions at two locations, (1) pre-phonatory set (2) mid-point of the vowel. Accuracy of pitch matching was compared across vibrato and nonvibrato conditions. Results indicated that there was no significant effect of formant pattern on pitch matching accuracy. With increasing pitch from C4 to F5, pitch matching accuracy increased in mid-point of the vowel condition but not in pre-phonatory set condition. Mezzosopranos moved towards being in tune from pre-phonatory to mid-point of the vowel. However, sopranos at C4, sang closer to being in tune at pre-phonatory, but lowered the pitch at the midpoint of the vowel. Presence or absence of vibrato did not affect the pitch matching accuracy. However, the interesting finding of the study was that singers attempted to match the timbre of stimuli with vibrato. Results are discussed in terms of interactions between pitch and timbre from auditory perceptual as well as physiological point of view and how current theories of pitch perception relate to this phenomenon. Neither physiological nor auditory perceptual mechanisms provide complete explanations for the results obtained in the study. From a perceptual point of view, an interaction between pitch and timbre seems to be more complex, for spectral and temporal theories are limited in explaining these interactions. Also, possible explanations for the phenomenon of timbre matching are provided

    Effects of increased levels of androgens on voice and vocal folds in women with congenital adrenal hyperplasia and female-to-male transsexual persons

    Get PDF
    Voice virilization in women may occur due to increased levels of androgens. Women with congenital adrenal hyperplasia (CAH) are at risk for voice virilization due to an enzyme deficiency that causes increased production of androgens and lack of cortisol. Female-to-male transsexual persons, trans men, are treated with testosterone, with virilization of the voice as a desired outcome. The overall aim of the project was to provide new knowledge on how female voice and vocal folds are affected by endogenous and exogenous androgen exposure, and the consequences virilization of the voice may have in a patient’s life. Study I: Thirty-eight women with CAH and 24 age-matched controls participated. Their voices were recorded and acoustically and perceptually analyzed. Theyanswered questions about subjective voice problems. Endocrine datawere obtained from medical journals. The results showed that women with CAH spoke with significantly lower mean fundamental frequency (F0), had darker voice quality, and rated higher on the statement “my voice is a problem in my daily life” than the controls. Voice virilization was associated with late diagnosis or problems with glucocorticoid medication, but not with severity of mutation. Proper treatment with glucocorticoids is important to avoid long periods of increased androgen levels to prevent irreversible voice virilization. Study II: Forty-two women with CAH and 43 age-matched controls filled out the Voice Handicap Index (VHI) and answered questions about voice function related to virilization. Endocrine data were obtained from medical journals. Women with CAH scored significantly higher than the controls on VHI when the results were divided into groups by voice handicap: none/mild, moderate, and severe. A virilized voice in women with CAH correlated with less voice satisfaction. Seven percent of the women with CAH had voice problems related to voice virilization. Voice virilization was associatedwith long periods of under-treatment with glucocorticoids and higher bone mineral density, confirming results and conclusions from study I. It is recommended that women with CAH who experience voice problems are referred for voice assessment. Study III: Four women with CAH and virilized voices, and 5 female and 4 male controls participated. A procedure for magnetic resonance imaging of the vocal folds was developed. The results showed that the cross-sectional area of the thyroarytenoid (TA) muscle was larger in women with virilized voices than in female controls, and smaller than in males. The larger TA area correlated with lower F0 values obtained from acoustic analysisof habitual speech range profiles. Thus, the anatomical explanation for voice virilization may be a larger crosssectional area of the TA muscle, suggesting androgen receptors in the vocal folds. These findings need to be confirmed in a larger study. Study IV: Fifty trans men participated in a longitudinal study. Voice assessments, performed before testosterone treatment started and regularly up to 24 months, included audiorecordings of speech and voice range profiles and self-ratings of voice function. A significant lowering of mean F0 was found after 3 months, after 6 months, and up to 12 months, when group data were congruent with reference data for males. No correlations were found between levels of testosterone, EVF, Hb, SHBG or LH, and F0 values. Lower F0 values correlated with greater satisfaction with the voice. A quarter of the participants had received voice therapy for problems associated with virilization, such as vocal fatigue or unstable voice. Voice assessment during testosterone treatment is important to detect the potentially large subgroup of trans men that needs voice therapy

    A review of differentiable digital signal processing for music and speech synthesis

    Get PDF
    The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research

    Singing information processing: techniques and applications

    Get PDF
    Por otro lado, se presenta un método para el cambio realista de intensidad de voz cantada. Esta transformación se basa en un modelo paramétrico de la envolvente espectral, y mejora sustancialmente la percepción de realismo al compararlo con software comerciales como Melodyne o Vocaloid. El inconveniente del enfoque propuesto es que requiere intervención manual, pero los resultados conseguidos arrojan importantes conclusiones hacia la modificación automática de intensidad con resultados realistas. Por último, se propone un método para la corrección de disonancias en acordes aislados. Se basa en un análisis de múltiples F0, y un desplazamiento de la frecuencia de su componente sinusoidal. La evaluación la ha realizado un grupo de músicos entrenados, y muestra un claro incremento de la consonancia percibida después de la transformación propuesta.La voz cantada es una componente esencial de la música en todas las culturas del mundo, ya que se trata de una forma increíblemente natural de expresión musical. En consecuencia, el procesado automático de voz cantada tiene un gran impacto desde la perspectiva de la industria, la cultura y la ciencia. En este contexto, esta Tesis contribuye con un conjunto variado de técnicas y aplicaciones relacionadas con el procesado de voz cantada, así como con un repaso del estado del arte asociado en cada caso. En primer lugar, se han comparado varios de los mejores estimadores de tono conocidos para el caso de uso de recuperación por tarareo. Los resultados demuestran que \cite{Boersma1993} (con un ajuste no obvio de parámetros) y \cite{Mauch2014}, tienen un muy buen comportamiento en dicho caso de uso dada la suavidad de los contornos de tono extraídos. Además, se propone un novedoso sistema de transcripción de voz cantada basada en un proceso de histéresis definido en tiempo y frecuencia, así como una herramienta para evaluación de voz cantada en Matlab. El interés del método propuesto es que consigue tasas de error cercanas al estado del arte con un método muy sencillo. La herramienta de evaluación propuesta, por otro lado, es un recurso útil para definir mejor el problema, y para evaluar mejor las soluciones propuestas por futuros investigadores. En esta Tesis también se presenta un método para evaluación automática de la interpretación vocal. Usa alineamiento temporal dinámico para alinear la interpretación del usuario con una referencia, proporcionando de esta forma una puntuación de precisión de afinación y de ritmo. La evaluación del sistema muestra una alta correlación entre las puntuaciones dadas por el sistema, y las puntuaciones anotadas por un grupo de músicos expertos

    The effects of three singer gestures on acoustic and perceptual measures of singing in solo and choral contexts

    Get PDF
    The purpose of this two-part investigation was to assess the potential effects of three singer gestures (low, circular arm gesture; arched hand gesture; and pointing gesture) on performances of choral singers (N = 31; Experiment 1) and solo singers (N = 35; Experiment 2). Participants sang the melody of three familiar songs from memory on the neutral syllable "m/i/." Songs were chosen for similarities of range, tessitura, and ascending intervallic leaps. Each song was sung seven times: Baseline (without singer gesture), five iterations of each song paired with a singer gesture, and a posttest (without singer gesture). Experiment 1 measured acoustic (long-term average spectra) and perceptual (pitch analysis, expert panel ratings, and participant perceptual questionnaire) differences in choral sound across conditions. Results indicated a significant increase in mean signal amplitude in sung gestural iterations with the low, circular gesture and pointing gesture. Intonation differences were significant between baseline and the low, circular gesture, baseline and posttest for the pointing gesture, and between the arched hand gesture and posttest. Expert panel ratings were highest during gestural conditions across song selections, and the majority of participants gave positive comments regarding use of gesture during choral singing. Experiment 2 measured acoustic (Fo, amplitude, formant frequency) and perceptual (expert panel ratings and participant perceptual questionnaire) differences of solo singers. Major findings indicated acoustic changes in intonation, timbre, and relative amplitude. Solo singers were more in tune when singing with gestures. Both the low, circular and arched hand gestures changed singer timbre indicated by lowered formant frequencies for the majority of participants. When performing with the low, circular and the pointing gestures, singers sang with increased amplitude, whereas, the arched hand gesture led to decreased amplitude. Expert ratings were highest for the posttest of low circular gestures and arched hand gestures, and the gestural iterations of pointing. The majority of participant comments related to intonation and timbre when using gestures. Video recording analyses from both performance contexts indicated participants mastered the gestures within the first three iterations. Results were discussed in terms of singing pedagogy, limitations of the study, and suggestions for further research

    Vocal emotions on the brain: the role of acoustic parameters and musicality

    Get PDF
    The human voice is a powerful transmitter of emotions. This dissertation addresses three main gaps in the field of vocal emotion perception. The first is the quantification of the relative contribution of fundamental frequency (F0) and timbre cues to the perception of different emotions and their associated electrophysiological correlates. Using parameter-specific voice morphing, the results show that both F0 and timbre carry unique information that allow emotional inferences, although F0 seems to be relatively more important overall. The electrophysiological data revealed F0- and timbre-specific modulations in several ERP components, such as the P200 and the N400. Second, it was explored how musicality affects the processing of emotional voice cues, by providing a review on the literature linking musicality to emotion perception and subsequently showing that musicians have a benefit in vocal emotion perception compared to non-musicians. The present data offer original insight into the special role of pitch cues: musicians outperformed non-musicians when emotions were expressed by the pitch contour only, but not when they were expressed by vocal timbre. Although the electrophysiological patterns were less conclusive, they imply that musicality may modulate brain responses to vocal emotions. Third, this work provides a critical reflection on parameter-specific voice morphing and its suitability to study the processing of vocal emotions. Distortions in voice naturalness resulting from extreme acoustic manipulations were identified as one of the major threats to the ecological validity of the stimulus material produced with this technique. However, the results suggested that while voice morphing does affect the perceived naturalness of stimuli, behavioral measures of emotion perception were found to be remarkably robust against these distortions. Thus, the present data advocate parameter-specific voice morphing as a valid tool for vocal emotional research

    Long-term average spectral characteristics of different Cantonese opera singing styles

    Get PDF
    Includes bibliographical references (p. 26-29).Thesis (B.Sc)--University of Hong Kong, 2010."A dissertation submitted in partial fulfillment of the requirements for the Bachelor of Science (Speech and Hearing Sciences), The University of Hong Kong, June 30, 2010."Cantonese Opera is a valuable cultural heritage populates in China. Basic singing styles consist of zi hou, ping hou and da hou. However, objective parameters measuring voice qualities in Cantonese opera singing are lacking. The current study examined the sound quality associated with zi hou, ping hou and da hou singing styles in comparison to conversational voice by means of Long-Term Averaged Spectra (LTAS). Continuous singing and speech samples were obtained from professional Cantonese opera singers and na?ve speakers of Cantonese. All singing and speech samples were digitized at 44 kHz and 16 bits/ sample. Parameters including the first spectral peak (FSP), mean spectral energy (MSE), spectral tilt (ST) and high frequency energy (HFE) were derived from the LTAS contours by using Praat. Different singing styles exhibited different LTAS contours and were associated with significantly a higher ST value than conversational voice, implying a difference in resonance. Further investigation on the phonatory mechanism is indicated.published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science

    Adding expressiveness to unit selection speech synthesis and to numerical voice production

    Get PDF
    La parla és una de les formes de comunicació més naturals i directes entre éssers humans, ja que codifica un missatge i també claus paralingüístiques sobre l’estat emocional del locutor, el to o la seva intenció, esdevenint així fonamental en la consecució d’una interacció humà-màquina (HCI) més natural. En aquest context, la generació de parla expressiva pel canal de sortida d’HCI és un element clau en el desenvolupament de tecnologies assistencials o assistents personals entre altres aplicacions. La parla sintètica pot ser generada a partir de parla enregistrada utilitzant mètodes basats en corpus com la selecció d’unitats (US), que poden aconseguir resultats d’alta qualitat però d’expressivitat restringida a la pròpia del corpus. A fi de millorar la qualitat de la sortida de la síntesi, la tendència actual és construir bases de dades de veu cada cop més grans, seguint especialment l’aproximació de síntesi anomenada End-to-End basada en tècniques d’aprenentatge profund. Tanmateix, enregistrar corpus ad-hoc per cada estil expressiu desitjat pot ser extremadament costós o fins i tot inviable si el locutor no és capaç de realitzar adequadament els estils requerits per a una aplicació donada (ex: cant en el domini de la narració de contes). Alternativament, nous mètodes basats en la física de la producció de veu s’han desenvolupat a la darrera dècada gràcies a l’increment en la potència computacional. Per exemple, vocals o diftongs poden ser obtinguts utilitzant el mètode d’elements finits (FEM) per simular la propagació d’ones acústiques a través d’una geometria 3D realista del tracte vocal obtinguda a partir de ressonàncies magnètiques (MRI). Tanmateix, atès que els principals esforços en aquests mètodes de producció numèrica de veu s’han focalitzat en la millora del modelat del procés de generació de veu, fins ara s’ha prestat poca atenció a la seva expressivitat. A més, la col·lecció de dades per aquestes simulacions és molt costosa, a més de requerir un llarg postprocessament manual com el necessari per extreure geometries 3D del tracte vocal a partir de MRI. L’objectiu de la tesi és afegir expressivitat en un sistema que genera veu neutra, sense haver d’adquirir dades expressives del locutor original. Per un costat, s’afegeixen capacitats expressives a un sistema de conversió de text a parla basat en selecció d’unitats (US-TTS) dotat d’un corpus de veu neutra, per adreçar necessitats específiques i concretes en l’àmbit de la narració de contes, com són la veu cantada o situacions de suspens. A tal efecte, la veu és parametritzada utilitzant un model harmònic i transformada a l’estil expressiu desitjat d’acord amb un sistema expert. Es presenta una primera aproximació, centrada en la síntesi de suspens creixent per a la narració de contes, i es demostra la seva viabilitat pel que fa a naturalitat i qualitat de narració de contes. També s’afegeixen capacitats de cant al sistema US-TTS mitjançant la integració de mòduls de transformació de parla a veu cantada en el pipeline del TTS, i la incorporació d’un mòdul de generació de prosòdia expressiva que permet al mòdul de US seleccionar unitats més properes a la prosòdia cantada obtinguda a partir de la partitura d’entrada. Això resulta en un framework de síntesi de conversió de text a parla i veu cantada basat en selecció d’unitats (US-TTS&S) que pot generar veu parlada i cantada a partir d'un petit corpus de veu neutra (~2.6h). D’acord amb els resultats objectius, l’estratègia de US guiada per la partitura permet reduir els factors de modificació de pitch requerits per produir veu cantada a partir de les unitats de veu parlada seleccionades, però en canvi té una efectivitat limitada amb els factors de modificació de les durades degut a la curta durada de les vocals parlades neutres. Els resultats dels tests perceptius mostren que tot i òbviament obtenir una naturalitat inferior a la oferta per un sintetitzador professional de veu cantada, el framework pot adreçar necessitats puntuals de veu cantada per a la síntesis de narració de contes amb una qualitat raonable. La incorporació d’expressivitat s’investiga també en la simulació numèrica 3D de vocals basada en FEM mitjançant modificacions de les senyals d’excitació glotal utilitzant una aproximació font-filtre de producció de veu. Aquestes senyals es generen utilitzant un model Liljencrants-Fant (LF) controlat amb el paràmetre de forma del pols Rd, que permet explorar el continu de fonació lax-tens a més del rang de freqüències fonamentals, F0, de la veu parlada. S’analitza la contribució de la font glotal als modes d’alt ordre en la síntesis FEM de les vocals cardinals [a], [i] i [u] mitjançant la comparació dels valors d’energia d’alta freqüència (HFE) obtinguts amb geometries realistes i simplificades del tracte vocal. Les simulacions indiquen que els modes d’alt ordre es preveuen perceptivament rellevants d’acord amb valors de referència de la literatura, particularment per a fonacions tenses i/o F0s altes. En canvi, per a vocals amb una fonació laxa i/o F0s baixes els nivells d’HFE poden resultar inaudibles, especialment si no hi ha soroll d’aspiració en la font glotal. Després d’aquest estudi preliminar, s’han analitzat les característiques d’excitació de vocals alegres i agressives d’un corpus paral·lel de veu en castellà amb l’objectiu d’incorporar aquests estils expressius de veu tensa en la simulació numèrica de veu. Per a tal efecte, s’ha usat el vocoder GlottDNN per analitzar variacions d’F0 i pendent espectral relacionades amb l’excitació glotal en vocals [a]. Aquestes variacions es mapegen mitjançant la comparació amb vocals sintètiques en valors d’F0 i Rd per simular vocals que s’assemblin als estils alegre i agressiu. Els resultats mostren que és necessari incrementar l’F0 i disminuir l’Rd respecte la veu neutra, amb variacions majors per a alegre que per agressiu, especialment per a vocals accentuades. Els resultats aconseguits en les investigacions realitzades validen la possibilitat d’afegir expressivitat a la síntesi basada en corpus US-TTS i a la simulació numèrica de veu basada en FEM. Tanmateix, encara hi ha marge de millora. Per exemple, l’estratègia aplicada a la producció numèrica de veu es podria millorar estudiant i desenvolupant mètodes de filtratge invers així com incorporant modificacions del tracte vocal, mentre que el framework US-TTS&S es podria beneficiar dels avenços en tècniques de transformació de veu incloent transformacions de la qualitat de veu, aprofitant l’experiència adquirida en la simulació numèrica de vocals expressives.El habla es una de las formas de comunicación más naturales y directas entre seres humanos, ya que codifica un mensaje y también claves paralingüísticas sobre el estado emocional del locutor, el tono o su intención, convirtiéndose así en fundamental en la consecución de una interacción humano-máquina (HCI) más natural. En este contexto, la generación de habla expresiva para el canal de salida de HCI es un elemento clave en el desarrollo de tecnologías asistenciales o asistentes personales entre otras aplicaciones. El habla sintética puede ser generada a partir de habla gravada utilizando métodos basados en corpus como la selección de unidades (US), que pueden conseguir resultados de alta calidad, pero de expresividad restringida a la propia del corpus. A fin de mejorar la calidad de la salida de la síntesis, la tendencia actual es construir bases de datos de voz cada vez más grandes, siguiendo especialmente la aproximación de síntesis llamada End-to-End basada en técnicas de aprendizaje profundo. Sin embargo, gravar corpus ad-hoc para cada estilo expresivo deseado puede ser extremadamente costoso o incluso inviable si el locutor no es capaz de realizar adecuadamente los estilos requeridos para una aplicación dada (ej: canto en el dominio de la narración de cuentos). Alternativamente, nuevos métodos basados en la física de la producción de voz se han desarrollado en la última década gracias al incremento en la potencia computacional. Por ejemplo, vocales o diptongos pueden ser obtenidos utilizando el método de elementos finitos (FEM) para simular la propagación de ondas acústicas a través de una geometría 3D realista del tracto vocal obtenida a partir de resonancias magnéticas (MRI). Sin embargo, dado que los principales esfuerzos en estos métodos de producción numérica de voz se han focalizado en la mejora del modelado del proceso de generación de voz, hasta ahora se ha prestado poca atención a su expresividad. Además, la colección de datos para estas simulaciones es muy costosa, además de requerir un largo postproceso manual como el necesario para extraer geometrías 3D del tracto vocal a partir de MRI. El objetivo de la tesis es añadir expresividad en un sistema que genera voz neutra, sin tener que adquirir datos expresivos del locutor original. Per un lado, se añaden capacidades expresivas a un sistema de conversión de texto a habla basado en selección de unidades (US-TTS) dotado de un corpus de voz neutra, para abordar necesidades específicas y concretas en el ámbito de la narración de cuentos, como son la voz cantada o situaciones de suspense. Para ello, la voz se parametriza utilizando un modelo harmónico y se transforma al estilo expresivo deseado de acuerdo con un sistema experto. Se presenta una primera aproximación, centrada en la síntesis de suspense creciente para la narración de cuentos, y se demuestra su viabilidad en cuanto a naturalidad y calidad de narración de cuentos. También se añaden capacidades de canto al sistema US-TTS mediante la integración de módulos de transformación de habla a voz cantada en el pipeline del TTS, y la incorporación de un módulo de generación de prosodia expresiva que permite al módulo de US seleccionar unidades más cercanas a la prosodia cantada obtenida a partir de la partitura de entrada. Esto resulta en un framework de síntesis de conversión de texto a habla y voz cantada basado en selección de unidades (US-TTS&S) que puede generar voz hablada y cantada a partir del mismo pequeño corpus de voz neutra (~2.6h). De acuerdo con los resultados objetivos, la estrategia de US guiada por la partitura permite reducir los factores de modificación de pitch requeridos para producir voz cantada a partir de las unidades de voz hablada seleccionadas, pero en cambio tiene una efectividad limitada con los factores de modificación de duraciones debido a la corta duración de las vocales habladas neutras. Los resultados de las pruebas perceptivas muestran que, a pesar de obtener una naturalidad obviamente inferior a la ofrecida por un sintetizador profesional de voz cantada, el framework puede abordar necesidades puntuales de voz cantada para la síntesis de narración de cuentos con una calidad razonable. La incorporación de expresividad se investiga también en la simulación numérica 3D de vocales basada en FEM mediante modificaciones en las señales de excitación glotal utilizando una aproximación fuente-filtro de producción de voz. Estas señales se generan utilizando un modelo Liljencrants-Fant (LF) controlado con el parámetro de forma del pulso Rd, que permite explorar el continuo de fonación laxo-tenso además del rango de frecuencias fundamentales, F0, de la voz hablada. Se analiza la contribución de la fuente glotal a los modos de alto orden en la síntesis FEM de las vocales cardinales [a], [i] y [u] mediante la comparación de los valores de energía de alta frecuencia (HFE) obtenidos con geometrías realistas y simplificadas del tracto vocal. Las simulaciones indican que los modos de alto orden se prevén perceptivamente relevantes de acuerdo con valores de referencia de la literatura, particularmente para fonaciones tensas y/o F0s altas. En cambio, para vocales con una fonación laxa y/o F0s bajas los niveles de HFE pueden resultar inaudibles, especialmente si no hay ruido de aspiración en la fuente glotal. Después de este estudio preliminar, se han analizado las características de excitación de vocales alegres y agresivas de un corpus paralelo de voz en castellano con el objetivo de incorporar estos estilos expresivos de voz tensa en la simulación numérica de voz. Para ello, se ha usado el vocoder GlottDNN para analizar variaciones de F0 y pendiente espectral relacionadas con la excitación glotal en vocales [a]. Estas variaciones se mapean mediante la comparación con vocales sintéticas en valores de F0 y Rd para simular vocales que se asemejen a los estilos alegre y agresivo. Los resultados muestran que es necesario incrementar la F0 y disminuir la Rd respecto la voz neutra, con variaciones mayores para alegre que para agresivo, especialmente para vocales acentuadas. Los resultados conseguidos en las investigaciones realizadas validan la posibilidad de añadir expresividad a la síntesis basada en corpus US-TTS y a la simulación numérica de voz basada en FEM. Sin embargo, hay margen de mejora. Por ejemplo, la estrategia aplicada a la producción numérica de voz se podría mejorar estudiando y desarrollando métodos de filtrado inverso, así como incorporando modificaciones del tracto vocal, mientras que el framework US-TTS&S desarrollado se podría beneficiar de los avances en técnicas de transformación de voz incluyendo transformaciones de la calidad de la voz, aprovechando la experiencia adquirida en la simulación numérica de vocales expresivas.Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications. Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI. The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality. The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels. The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels

    Singing voice resynthesis using concatenative-based techniques

    Get PDF
    Tese de Doutoramento. Engenharia Informática. Faculdade de Engenharia. Universidade do Porto. 201
    corecore