Search CORE

12,070 research outputs found

조건부 자기회귀형 인공신경망을 이용한 제어 가능한 가창 음성 합성

Author: 이주헌
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2022. 8. 이교구.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.가창 합성은 주어진 입력 악보로부터 자연스러운 가창 음성을 합성해내는 것을 목표로 한다. 가창 합성 시스템은 음악 제작 비용을 크게 줄일 수 있을 뿐만 아니라 창작자의 의도를 보다 쉽고 편리하게 반영할 수 있도록 돕는다. 하지만 이러한 시스템의 설계를 위해서는 다음 세 가지의 도전적인 요구사항이 존재한다. 1) 가창을 이루는 다양한 요소를 독립적으로 제어할 수 있어야 한다. 2) 높은 품질 수준 및 사용성을 달성해야 한다. 3) 충분한 훈련 데이터를 확보하기 어렵다. 이러한 문제에 대응하기 위해 우리는 대표적인 음성 생성 모델링 기법인 소스-필터 이론에 주목하였다. 가창 신호를 음정 정보에 해당하는 소스와 발음 정보에 해당하는 필터의 합성곱으로 정의하고, 이를 각각 독립적으로 모델링할 수 있는 구조를 설계하여 훈련 데이터 효율성과 제어 가능성을 동시에 확보하고자 하였다. 또한 우리는 발음, 음정, 화자 등 조건부 입력이 주어진 상황에서 시계열 데이터를 효과적으로 모델링하기 위하여 조건부 자기회귀 모델 기반의 심층신경망을 활용하였다. 마지막으로 레이블링 되어있지 않은 음악적 표현을 모델링할 수 있도록 우리는 자기지도학습 기반의 스타일 모델링 기법을 제안했다. 우리는 제안한 모델이 발음, 음정, 음색, 창법, 표현 등 다양한 요소를 유연하게 제어하면서도 실제 가창과 구분이 어려운 수준의 고품질 가창 합성이 가능함을 확인했다. 나아가 실제 음악 제작 과정을 고려한 생성 및 수정 프레임워크를 제안하였고, 새로운 목소리 디자인, 교차 생성 등 창작자의 상상력과 한계를 넓힐 수 있는 응용이 가능함을 확인했다.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104박

SNU Open Repository and Archive

Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis

Author: Gu Hung-Yan
Wang Hsin-Min
Wang Ju-Chiang
Publication venue
Publication date: 15/02/2015
Field of study

The purpose of this study is to investigate how humans interpret musical scores expressively, and then design machines that sing like humans. We consider six factors that have a strong influence on the expression of human singing. The factors are related to the acoustic, phonetic, and musical features of a real singing signal. Given real singing voices recorded following the MIDI scores and lyrics, our analysis module can extract the expression parameters from the real singing signals semi-automatically. The expression parameters are used to control the singing voice synthesis (SVS) system for Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The results of perceptual experiments show that integrating the expression factors into the SVS system yields a notable improvement in perceptual naturalness, clearness, and expressiveness. By one-to-one mapping of the real singing signal and expression controls to the synthesizer, our SVS system can simulate the interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor

arXiv.org e-Print Archive

CiteSeerX

Auditory-Motor Adaptation to Frequency-Altered Auditory Feedback Occurs When Participants Ignore Feedback

Author: Hawco Colin
Jones Jeffery A.
Keough Dwayne Nicholas
Publication venue: Scholars Commons @ Laurier
Publication date: 01/03/2013
Field of study

Background Auditory feedback is important for accurate control of voice fundamental frequency (F0). The purpose of this study was to address whether task instructions could influence the compensatory responding and sensorimotor adaptation that has been previously found when participants are presented with a series of frequency-altered feedback (FAF) trials. Trained singers and musically untrained participants (nonsingers) were informed that their auditory feedback would be manipulated in pitch while they sang the target vowel [/ɑ /]. Participants were instructed to either ‘compensate’ for, or ‘ignore’ the changes in auditory feedback. Whole utterance auditory feedback manipulations were either gradually presented (‘ramp’) in -2 cent increments down to -100 cents (1 semitone) or were suddenly (’constant‘) shifted down by 1 semitone. Results Results indicated that singers and nonsingers could not suppress their compensatory responses to FAF, nor could they reduce the sensorimotor adaptation observed during both the ramp and constant FAF trials. Conclusions Compared to previous research, these data suggest that musical training is effective in suppressing compensatory responses only when FAF occurs after vocal onset (500-2500 ms). Moreover, our data suggest that compensation and adaptation are automatic and are influenced little by conscious control

Wilfrid Laurier University

A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation

Author: Banno Hideki
Kawahara Hideki
Morise Masanori
Sakakibara Ken-Ichi
Toda Tomoki
Publication venue: 'International Speech Communication Association'
Publication date: 09/06/2017
Field of study

We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0dB to 80dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.Comment: 8 pages 9 figures, Submitted and accepted in Interspeech201

arXiv.org e-Print Archive

Crossref

PYIN: A FUNDAMENTAL FREQUENCY ESTIMATOR USING PROBABILISTIC THRESHOLD DISTRIBUTIONS

Author: Dixon S
IEEE
Mauch M
Publication venue
Publication date: 01/01/2014
Field of study

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Queen Mary Research Online

Pitch modification techniques for sampled voice

Author: Brooks Michael
Publication venue
Publication date: 27/06/2018
Field of study

The Australian National University

Adding expressiveness to unit selection speech synthesis and to numerical voice production

Author: Freixes Guerreiro Marc
Publication venue: Blanquerna - Universitat Ramon Llull
Publication date: 18/06/2021
Field of study

La parla és una de les formes de comunicació més naturals i directes entre éssers humans, ja que codifica un missatge i també claus paralingüístiques sobre l’estat emocional del locutor, el to o la seva intenció, esdevenint així fonamental en la consecució d’una interacció humà-màquina (HCI) més natural. En aquest context, la generació de parla expressiva pel canal de sortida d’HCI és un element clau en el desenvolupament de tecnologies assistencials o assistents personals entre altres aplicacions. La parla sintètica pot ser generada a partir de parla enregistrada utilitzant mètodes basats en corpus com la selecció d’unitats (US), que poden aconseguir resultats d’alta qualitat però d’expressivitat restringida a la pròpia del corpus. A fi de millorar la qualitat de la sortida de la síntesi, la tendència actual és construir bases de dades de veu cada cop més grans, seguint especialment l’aproximació de síntesi anomenada End-to-End basada en tècniques d’aprenentatge profund. Tanmateix, enregistrar corpus ad-hoc per cada estil expressiu desitjat pot ser extremadament costós o fins i tot inviable si el locutor no és capaç de realitzar adequadament els estils requerits per a una aplicació donada (ex: cant en el domini de la narració de contes). Alternativament, nous mètodes basats en la física de la producció de veu s’han desenvolupat a la darrera dècada gràcies a l’increment en la potència computacional. Per exemple, vocals o diftongs poden ser obtinguts utilitzant el mètode d’elements finits (FEM) per simular la propagació d’ones acústiques a través d’una geometria 3D realista del tracte vocal obtinguda a partir de ressonàncies magnètiques (MRI). Tanmateix, atès que els principals esforços en aquests mètodes de producció numèrica de veu s’han focalitzat en la millora del modelat del procés de generació de veu, fins ara s’ha prestat poca atenció a la seva expressivitat. A més, la col·lecció de dades per aquestes simulacions és molt costosa, a més de requerir un llarg postprocessament manual com el necessari per extreure geometries 3D del tracte vocal a partir de MRI. L’objectiu de la tesi és afegir expressivitat en un sistema que genera veu neutra, sense haver d’adquirir dades expressives del locutor original. Per un costat, s’afegeixen capacitats expressives a un sistema de conversió de text a parla basat en selecció d’unitats (US-TTS) dotat d’un corpus de veu neutra, per adreçar necessitats específiques i concretes en l’àmbit de la narració de contes, com són la veu cantada o situacions de suspens. A tal efecte, la veu és parametritzada utilitzant un model harmònic i transformada a l’estil expressiu desitjat d’acord amb un sistema expert. Es presenta una primera aproximació, centrada en la síntesi de suspens creixent per a la narració de contes, i es demostra la seva viabilitat pel que fa a naturalitat i qualitat de narració de contes. També s’afegeixen capacitats de cant al sistema US-TTS mitjançant la integració de mòduls de transformació de parla a veu cantada en el pipeline del TTS, i la incorporació d’un mòdul de generació de prosòdia expressiva que permet al mòdul de US seleccionar unitats més properes a la prosòdia cantada obtinguda a partir de la partitura d’entrada. Això resulta en un framework de síntesi de conversió de text a parla i veu cantada basat en selecció d’unitats (US-TTS&S) que pot generar veu parlada i cantada a partir d'un petit corpus de veu neutra (~2.6h). D’acord amb els resultats objectius, l’estratègia de US guiada per la partitura permet reduir els factors de modificació de pitch requerits per produir veu cantada a partir de les unitats de veu parlada seleccionades, però en canvi té una efectivitat limitada amb els factors de modificació de les durades degut a la curta durada de les vocals parlades neutres. Els resultats dels tests perceptius mostren que tot i òbviament obtenir una naturalitat inferior a la oferta per un sintetitzador professional de veu cantada, el framework pot adreçar necessitats puntuals de veu cantada per a la síntesis de narració de contes amb una qualitat raonable. La incorporació d’expressivitat s’investiga també en la simulació numèrica 3D de vocals basada en FEM mitjançant modificacions de les senyals d’excitació glotal utilitzant una aproximació font-filtre de producció de veu. Aquestes senyals es generen utilitzant un model Liljencrants-Fant (LF) controlat amb el paràmetre de forma del pols Rd, que permet explorar el continu de fonació lax-tens a més del rang de freqüències fonamentals, F0, de la veu parlada. S’analitza la contribució de la font glotal als modes d’alt ordre en la síntesis FEM de les vocals cardinals [a], [i] i [u] mitjançant la comparació dels valors d’energia d’alta freqüència (HFE) obtinguts amb geometries realistes i simplificades del tracte vocal. Les simulacions indiquen que els modes d’alt ordre es preveuen perceptivament rellevants d’acord amb valors de referència de la literatura, particularment per a fonacions tenses i/o F0s altes. En canvi, per a vocals amb una fonació laxa i/o F0s baixes els nivells d’HFE poden resultar inaudibles, especialment si no hi ha soroll d’aspiració en la font glotal. Després d’aquest estudi preliminar, s’han analitzat les característiques d’excitació de vocals alegres i agressives d’un corpus paral·lel de veu en castellà amb l’objectiu d’incorporar aquests estils expressius de veu tensa en la simulació numèrica de veu. Per a tal efecte, s’ha usat el vocoder GlottDNN per analitzar variacions d’F0 i pendent espectral relacionades amb l’excitació glotal en vocals [a]. Aquestes variacions es mapegen mitjançant la comparació amb vocals sintètiques en valors d’F0 i Rd per simular vocals que s’assemblin als estils alegre i agressiu. Els resultats mostren que és necessari incrementar l’F0 i disminuir l’Rd respecte la veu neutra, amb variacions majors per a alegre que per agressiu, especialment per a vocals accentuades. Els resultats aconseguits en les investigacions realitzades validen la possibilitat d’afegir expressivitat a la síntesi basada en corpus US-TTS i a la simulació numèrica de veu basada en FEM. Tanmateix, encara hi ha marge de millora. Per exemple, l’estratègia aplicada a la producció numèrica de veu es podria millorar estudiant i desenvolupant mètodes de filtratge invers així com incorporant modificacions del tracte vocal, mentre que el framework US-TTS&S es podria beneficiar dels avenços en tècniques de transformació de veu incloent transformacions de la qualitat de veu, aprofitant l’experiència adquirida en la simulació numèrica de vocals expressives.El habla es una de las formas de comunicación más naturales y directas entre seres humanos, ya que codifica un mensaje y también claves paralingüísticas sobre el estado emocional del locutor, el tono o su intención, convirtiéndose así en fundamental en la consecución de una interacción humano-máquina (HCI) más natural. En este contexto, la generación de habla expresiva para el canal de salida de HCI es un elemento clave en el desarrollo de tecnologías asistenciales o asistentes personales entre otras aplicaciones. El habla sintética puede ser generada a partir de habla gravada utilizando métodos basados en corpus como la selección de unidades (US), que pueden conseguir resultados de alta calidad, pero de expresividad restringida a la propia del corpus. A fin de mejorar la calidad de la salida de la síntesis, la tendencia actual es construir bases de datos de voz cada vez más grandes, siguiendo especialmente la aproximación de síntesis llamada End-to-End basada en técnicas de aprendizaje profundo. Sin embargo, gravar corpus ad-hoc para cada estilo expresivo deseado puede ser extremadamente costoso o incluso inviable si el locutor no es capaz de realizar adecuadamente los estilos requeridos para una aplicación dada (ej: canto en el dominio de la narración de cuentos). Alternativamente, nuevos métodos basados en la física de la producción de voz se han desarrollado en la última década gracias al incremento en la potencia computacional. Por ejemplo, vocales o diptongos pueden ser obtenidos utilizando el método de elementos finitos (FEM) para simular la propagación de ondas acústicas a través de una geometría 3D realista del tracto vocal obtenida a partir de resonancias magnéticas (MRI). Sin embargo, dado que los principales esfuerzos en estos métodos de producción numérica de voz se han focalizado en la mejora del modelado del proceso de generación de voz, hasta ahora se ha prestado poca atención a su expresividad. Además, la colección de datos para estas simulaciones es muy costosa, además de requerir un largo postproceso manual como el necesario para extraer geometrías 3D del tracto vocal a partir de MRI. El objetivo de la tesis es añadir expresividad en un sistema que genera voz neutra, sin tener que adquirir datos expresivos del locutor original. Per un lado, se añaden capacidades expresivas a un sistema de conversión de texto a habla basado en selección de unidades (US-TTS) dotado de un corpus de voz neutra, para abordar necesidades específicas y concretas en el ámbito de la narración de cuentos, como son la voz cantada o situaciones de suspense. Para ello, la voz se parametriza utilizando un modelo harmónico y se transforma al estilo expresivo deseado de acuerdo con un sistema experto. Se presenta una primera aproximación, centrada en la síntesis de suspense creciente para la narración de cuentos, y se demuestra su viabilidad en cuanto a naturalidad y calidad de narración de cuentos. También se añaden capacidades de canto al sistema US-TTS mediante la integración de módulos de transformación de habla a voz cantada en el pipeline del TTS, y la incorporación de un módulo de generación de prosodia expresiva que permite al módulo de US seleccionar unidades más cercanas a la prosodia cantada obtenida a partir de la partitura de entrada. Esto resulta en un framework de síntesis de conversión de texto a habla y voz cantada basado en selección de unidades (US-TTS&S) que puede generar voz hablada y cantada a partir del mismo pequeño corpus de voz neutra (~2.6h). De acuerdo con los resultados objetivos, la estrategia de US guiada por la partitura permite reducir los factores de modificación de pitch requeridos para producir voz cantada a partir de las unidades de voz hablada seleccionadas, pero en cambio tiene una efectividad limitada con los factores de modificación de duraciones debido a la corta duración de las vocales habladas neutras. Los resultados de las pruebas perceptivas muestran que, a pesar de obtener una naturalidad obviamente inferior a la ofrecida por un sintetizador profesional de voz cantada, el framework puede abordar necesidades puntuales de voz cantada para la síntesis de narración de cuentos con una calidad razonable. La incorporación de expresividad se investiga también en la simulación numérica 3D de vocales basada en FEM mediante modificaciones en las señales de excitación glotal utilizando una aproximación fuente-filtro de producción de voz. Estas señales se generan utilizando un modelo Liljencrants-Fant (LF) controlado con el parámetro de forma del pulso Rd, que permite explorar el continuo de fonación laxo-tenso además del rango de frecuencias fundamentales, F0, de la voz hablada. Se analiza la contribución de la fuente glotal a los modos de alto orden en la síntesis FEM de las vocales cardinales [a], [i] y [u] mediante la comparación de los valores de energía de alta frecuencia (HFE) obtenidos con geometrías realistas y simplificadas del tracto vocal. Las simulaciones indican que los modos de alto orden se prevén perceptivamente relevantes de acuerdo con valores de referencia de la literatura, particularmente para fonaciones tensas y/o F0s altas. En cambio, para vocales con una fonación laxa y/o F0s bajas los niveles de HFE pueden resultar inaudibles, especialmente si no hay ruido de aspiración en la fuente glotal. Después de este estudio preliminar, se han analizado las características de excitación de vocales alegres y agresivas de un corpus paralelo de voz en castellano con el objetivo de incorporar estos estilos expresivos de voz tensa en la simulación numérica de voz. Para ello, se ha usado el vocoder GlottDNN para analizar variaciones de F0 y pendiente espectral relacionadas con la excitación glotal en vocales [a]. Estas variaciones se mapean mediante la comparación con vocales sintéticas en valores de F0 y Rd para simular vocales que se asemejen a los estilos alegre y agresivo. Los resultados muestran que es necesario incrementar la F0 y disminuir la Rd respecto la voz neutra, con variaciones mayores para alegre que para agresivo, especialmente para vocales acentuadas. Los resultados conseguidos en las investigaciones realizadas validan la posibilidad de añadir expresividad a la síntesis basada en corpus US-TTS y a la simulación numérica de voz basada en FEM. Sin embargo, hay margen de mejora. Por ejemplo, la estrategia aplicada a la producción numérica de voz se podría mejorar estudiando y desarrollando métodos de filtrado inverso, así como incorporando modificaciones del tracto vocal, mientras que el framework US-TTS&S desarrollado se podría beneficiar de los avances en técnicas de transformación de voz incluyendo transformaciones de la calidad de la voz, aprovechando la experiencia adquirida en la simulación numérica de vocales expresivas.Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications. Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI. The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality. The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels. The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels

Tesis Doctorals en Xarxa

Investigating the utility of ultrasound visual biofeedback in voice instruction for two different singing styles

Author: Smith Kristen
Publication venue: Scholarship & Creative Works @ Digital UNC
Publication date: 04/08/2021
Field of study

Smith, Kristen J. Investigating the Utility of Ultrasound Visual Biofeedback in Voice Instruction for Two Different Singing Styles. Unpublished Master of Arts thesis, University of Northern Colorado, 2021. Purpose: The purpose of this study was to investigate the potential utility of incorporating real-time visual biofeedback using ultrasonography to teach important concepts of vocal pedagogy to voice students. Exploration of innovative teaching tools, such as ultrasound visual biofeedback (U-VBF) in singing instruction, may contribute to bridging the gap between voice science and pedagogy by providing alternative ways to improve students’ kinesthetic awareness, clarify complex topics in voice physiology and acoustics, and create a common dialogue between different professionals specializing in voice. The primary research questions addressed in this study were: (a) To determine the current knowledge and attitude among voice teachers regarding use of visual biofeedback in singing instruction; (b) To determine voice teachers’ interest in learning about technology, specifically U-VBF; (c) To identify external variables that influence voice teachers’ perceptions of the usefulness and ease of use of U-VBF; and (d) To determine voice teachers’ attitudes of using U-VBF in teaching after viewing an instructional video. Methods: A pre-post survey design was adopted to assess perceptions, attitude, and interest of professional voice teachers regarding use of U-VBF before and after viewing of an instructional video on the use of ultrasound to teach concepts, such as vocal timbre, for two different singing styles: musical theater and opera. Multi-sampling methods were used to recruit professional voice teachers across the U.S. and abroad. Survey data were collected between February and April 2021. Following assumptions made by the Technology Acceptance Model (TAM) regarding user technology acceptance and behavior, data based on a final sample size of 56 participants were analyzed via descriptive statistics and thematic analysis of qualitative data. Results: Despite being largely unfamiliar with U-VBF, most participants initially expressed high expectations, believing it to be helpful in singing instruction, but difficult to use. Those who expressed more positive opinions regarding use of U-VBF in singing instruction also expressed higher levels of interest in using it in their teaching. Perceived usefulness, ease of use and interest of U-VBF were not found to be prominently related to select external variables. While perceived usefulness of U-VBF slightly declined post-viewing of the instructional video, perceived ease of use and participants’ opinions of effective use increased. Interest in the use of U-VBF as well as likelihood to use U-VBF marginally increased after viewing the video. Conclusions: These findings agree with the assumptions made by the TAM regarding associations between familiarity, perceived usefulness, perceived ease of use, and interest. Comparison between the rankings for perceived usefulness of U-VBF pre- and post-viewing of the instructional video suggests a general sense of uncertainty among voice teachers regarding use of U-VBF in singing instruction. While teachers conveyed high levels of interest, opinions of U-VBF to teach vocal pedagogy concepts slightly declined following viewing of the instructional video, suggesting a lowering of expectations. However, increased perceptions regarding ease of use indicated high levels of believed self-efficacy in using U-VBF. Understanding the relationships between perceived usefulness, ease of use, and interest can shed insight on whether voice teachers would adopt U-VBF as a supplementary tool in singing instruction

University of Northern Colorado