Search CORE

9 research outputs found

Idiomatic Expression in Song Lyrics from Arianna Grande Album Positions

Author: Nugrahani Dyah
Ristanti Putri
Sukmaningrum Rahmawati
Publication venue: 'Universitas PGRI Semarang'
Publication date: 27/01/2023
Field of study

A song is a literary work that can describe a person’s feelings. Many songwriters write the song lyrics according to the feelings they felt at the time. This research aims to (1) classify the types of idiom expressions that were found in song lyrics of the album Positions by Arianna Grande. (2) find out the dominant types of idiom expression found in song lyrics of album Positions by Arianna Grande’s. (3) identify the meaning of idiom expression found in song lyrics of album Positions by Arianna Grande. This research is included in qualitative descriptive research. Researchers have implemented several steps, are a follows: downloading all songs of album Positions from website, searching te scripts lyrics songs, listening to the songs as data for the research been carefully, finding the idiomatic expression in Positions album by Arianna Grande, classifying the idiomatic expression based on the types, finding the idiomatic meanings in Positions album by Arianna Grande. The result of the analysis, the researcher found 16 songs that used the idiomatic expression 22 songs used the types of idiomatic expression phrasal verb. Then, the researcher found 8 songs that used the types of idiomatic expression, preposition verb. In addition, the researcher found 9 songs types of idiomatic expression that used partial idioms. The researcher concluded a percentage: phrasal verb (56.40%), preposition verbs (20.50%), and partial idioms (23.10%). The dominant type of idiomatic expression used in the “Position” album was the phrasal verb type, with the highest percentage of 56.40%

Journal Universitas PGRI Semarang

Multimodal Lyrics-Rhythm Matching

Author: Guessford Jesse
Liao Callie C.
Liao Duoduo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/03/2023
Field of study

Despite the recent increase in research on artificial intelligence for music, prominent correlations between key components of lyrics and rhythm such as keywords, stressed syllables, and strong beats are not frequently studied. This is likely due to challenges such as audio misalignment, inaccuracies in syllabic identification, and most importantly, the need for cross-disciplinary knowledge. To address this lack of research, we propose a novel multimodal lyrics-rhythm matching approach in this paper that specifically matches key components of lyrics and music with each other without any language limitations. We use audio instead of sheet music with readily available metadata, which creates more challenges yet increases the application flexibility of our method. Furthermore, our approach creatively generates several patterns involving various multimodalities, including music strong beats, lyrical syllables, auditory changes in a singer's pronunciation, and especially lyrical keywords, which are utilized for matching key lyrical elements with key rhythmic elements. This advantageous approach not only provides a unique way to study auditory lyrics-rhythm correlations including efficient rhythm-based audio alignment algorithms, but also bridges computational linguistics with music as well as music cognition. Our experimental results reveal an 0.81 probability of matching on average, and around 30% of the songs have a probability of 0.9 or higher of keywords landing on strong beats, including 12% of the songs with a perfect landing. Also, the similarity metrics are used to evaluate the correlation between lyrics and rhythm. It shows that nearly 50% of the songs have 0.70 similarity or higher. In conclusion, our approach contributes significantly to the lyrics-rhythm relationship by computationally unveiling insightful correlations.Comment: Accepted by 2022 IEEE International Conference on Big Data (IEEE Big Data 2022

arXiv.org e-Print Archive

Procedural Generation of Musical Metrics Based on Lyrics Analysis

Author: Urbano Jorge Bragança da Silva Ferreira
Publication venue
Publication date: 26/07/2016
Field of study

Mais do que a componente semântica e discursiva, as letras musicais contêm geralmente outro tipo de informação, que mais do que com o ato da escrita, tem que ver com o ato da pronúncia. Assumindo que uma letra musical é escrita para posteriormente ser reproduzida verbalmente, há um cuidado para que esse processo nos transmita algo também, completamente diferente daquilo que nos é transmitido pela letra no papel. A sincronia das acentuações fonéticas e lexicais da letra com as componentes musicais em que se insere é disso o maior exemplo. Neste projeto, a proposta é criar um sistema capaz de devolver informação musical para uma dada letra. Mais concretamente, informação relativa à métrica. Para o efeito, utilizarei o CMUdict, um dicionário de informação fonética para a língua inglesa que contém, para cada palavra, a divisão por fonemas com os respectivos marcadores referentes à sua acentuação. Todo o funcionamento do sistema será baseado na linguagem de programação Python, tendo sido todo o código desenvolvido por mim especialmente para o projeto. Para cada letra introduzida, será executada uma análise por versos e cada verso será transformado num template métrico. Todos os versos da letra serão ajustados a cada um dos templates e serão classificados, de forma a perceber-se qual o template que melhor se ajusta à letra em geral. O template com maior pontuação será escolhido como estrutura métrica final.More than the semantic and discursive components, the musical lyrics often contain other information, that more than with the act of writing, has to do with the act of pronunciation. Assuming that the musical lyrics are written to later be reproduced verbally, there is a caution for this process to pass us something too, completely different from what is conveyed by the lyrics on paper. The synchrony of phonetic and lexical accents of the lyrics with the musical components in which it belongs is a great example of that. In this project, the proposal is to create a system able to return music information for a given lyrics. More specifically, information on the metrics. To this end, I will use the CMUdict, a phonetic information dictionary for English language that contains, for each word, the division of its phonemes with the respective markers related to their stress. The entire operation of the system will be based on Python programming language, having all the code been developed by me especially for the project. For each letter entered, it will run an analysis by verses and each verse will become a metric template. All the verses from the lyrics will be adjusted to each of the templates and will be classified in order to select what is the template that best fits the letter in general. The template with the highest score will be chosen as the metric final structure

Repositório Aberto da Universidade do Porto

A influência do conteúdo e andamento musical e do perfil dos consumidores sobre o valor da marca de artista

Author: Nasr Georgia Coelho
Publication venue
Publication date: 28/06/2019
Field of study

Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Faculdade de Economia, Administração, Contabilidade e Gestão de Políticas Públicas, Departamento de Administração, 2019.Em um mercado composto por uma vasta quantidade de estilos e referências musicais é crucial estudar os aspectos que geram comportamentos dos ouvintes e, por consequência, que podem construir a marca de um artista. Dentro desse contexto, analisar a influência de escolhas estratégicas em uma composição musical e da característica de ouvintes é útil para artistas e gestores que pretendem planejar uma carreira e atingir objetivos relacionados ao valor da marca. Esta pesquisa tem o intuito de investigar a influência do andamento e conteúdo musical e do perfil dos consumidores sobre o valor da marca de artistas. O método experimental com delineamento 2x2, foi aplicado com corte transversal entre consumidores, ao longo de um mês com uso de dados primários, derivados de um questionário aplicado a 494 respondentes. Os resultados apontam que o andamento musical e o conteúdo musical apresentam influência na variável dependente, valor da marca, mas depende do perfil de consumidores. Dessa forma, o estudo apresenta novas evidências para o aprimoramento de estratégias de gerenciamento da carreira de artista

Biblioteca Digital de Monografias

Samba and bossa nova : some aspects of the music-text relationship

Author: Ricci Gabriela, 1988-
Publication venue: [s.n.]
Publication date: 13/07/2020
Field of study

Orientador: Eleonora Cavalcante AlbanoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Estudos da LinguagemResumo: O samba e a bossa nova são gêneros da música popular brasileira que têm ampla difusão internacional e chamam a atenção, entre outras particularidades, por sua maneira de cantar, que se aproxima bastante da fala do Português Brasileiro (PB). O objetivo deste trabalho é investigar a relação entre os padrões prosódicos linguístico e musical dos dois gêneros a fim de corroborar a hipótese de que ambos são construídos com base na prosódia do PB, o que pode ser investigado por meio de três hipóteses auxiliares: i) a incidência dos acentos musicais é maior em sílabas tônicas do que em átonas, aproximando a interpretação das letras das canções à fala do PB; ii) a acentuação musical da melodia é reforçada pelo acompanhamento instrumental; e iii) a relação acento musical ¿ acento linguístico é similar no samba e na bossa nova. Para tanto, analisamos os seguintes aspectos de três canções representantes de cada gênero: as pausas feitas pelos intérpretes e sua relação com alguns tipos de segmentação linguística relevantes no PB; o acúmulo de acentos musicais em partituras e interpretações e sua relação com o acento lexical das letras de canções do samba e da bossa nova; as mudanças feitas pelos intérpretes na acentuação musical sugerida pela partitura; a contribuição da rítmica do acompanhamento instrumental para o reforço da interpretação. Por fim, comparamos os gêneros. Os resultados mostram que, no que diz respeito à segmentação e à acentuação, ambos os gêneros têm sua referência na fala do PB. Seu parentesco foi reforçado tendo em vista a grande quantidade de semelhanças ao longo das análises. Assim, as diferenças encontradas podem ser entendidas como características distintivas de cada gênero musicalAbstract: Both Samba and bossa nova are genres of Brazilian popular music that are known all around the world and draw attention, among other things, because of their singing sounds similar to Brazilian Portuguese (BP) speech. This research intends to investigate the relationship between the prosodic patterns of lyrics and music in the two genres to corroborate the hypothesis that they are both based on BP prosody. The main hypothesis was investigated through three auxiliary hypotheses: i) the incidence of musical accent is higher on stressed syllables; ii) the musical accent from the melody is reinforced by the instrumental accompaniment; and iii) the relationship musical and linguistic accent is similar in samba and bossa nova. To this effect, we analyzed the following aspects of three representative songs of each genre: the pauses in interpretation and their relationship with some linguistic constituents of BP prosodic structure; the accumulation of musical accents in both the score and the interpretation and its relationship with the lexical stresses of the lyrics; the changes made by the singers relative to the score; the contribution of the rhythmics of instrumental accompaniment to reinforce vocal interpretation. Finally, we compared the two genres. The results show that segmentation and accentuation in the singing of samba and bossa nova refers to BP speech. The communalities between these genres rest on their similarities. On the other hand, the distinguishing characteristics of each genre rest on their differencesMestradoLinguisticaMestra em Linguística2013/15872-3FAPES

Repositorio da Producao Cientifica e Intelectual da Unicamp

Information technological aspects in the field of music. Overview

Author: Kruus Kaarel
Publication venue: Tartu Ülikool
Publication date: 01/01/2012
Field of study

Uurimuse põhieesmärgiks on anda lugejale ülevaade nootide (noodilehtede) ettevalmistamist ja muusika esitamist toetavatest tarkvarapakettidest ning tutvustada olulisemaid aspekte, mis on seotud nende rakendamisega muusikavallas. Üksikasjaliku ülevaate esitab töö tulemusena valminud veebisõelmete andmebaas koos seda esitava veebirakendusega, mis sisaldab nimetatud tarkvarapakette iseloomustavaid kirjeid. Töö tekstiline osa, st dokument, kirjeldab kokkuvõtlikult olulisemaid aspekte koos mõningate tarkvaraliste näidetega. Osutub, et kõige rohkem leidub internetis noodigraafika töötlemise ning diginoodiks teisendamise vahendeid – vastavalt 98 ja 13 rakendust. Nende valdkondadega seotud töö jaotistes sätestatakse erinevad kriteeriumid, mida nimetatud rakenduste andmebaasi kandmisel arvesse võeti, aga ka meetodeid ja probleeme, millega vastavate rakenduste kasutamisel arvestada tuleks. Uurimust alustades oli üks esmaseid eesmärke koguda võimalikult palju informatsiooni intelligentsete muusikaseadmete, eelkõige elektroonilis-intelligentsete noodipultide kohta. Paraku leidub just nimelt selles valdkonnas kõige vähem vahendeid – kokku vaid 4 rakendust, millest reaalselt kasutatav on vaid üks. Töös kirjeldatakse rakenduste võimalikke omavahelisi võrdlusmomente, analüüsitakse vaadeldava valdkonna nüansse ning tutvustatakse arenguperspektiive. Informatiivsuse huvides on esitletud aga ka tarkvarakomponente ja -pakette (sh raamistikke), mis kaudselt toetavad nootide (noodilehtede) ettevalmistamist ning muusika esitamist – kokku 55 kirjet. Lisaks kirjeldatakse muusikaõpet toetavaid vahendeid. Nendest on andmebaasi kantud kokku 14 rakendust. Antakse põgus ülevaade olemasolevatest huvitavamatest noodikogudest ning nende kasutamisvõimalustest; andmebaasi lisatud vastavalt 13 kirjet. Tutvustatakse aga ka uurimuse kontekstiga seotud bibliograafiat ning ühte tuntumat konverentsiseeriat (ISMIR), mille raames on paljud publikatsioonid valminud. Publikatsioonide loetelu on samuti lisatud töö käigus valminud andmebaasi – kokku 113 kirjet. Arvestades, et pakettide kasutajaliidesed on reeglina ingliskeelsed, on koostatud vastav inglise-eesti terminisõnastik.The main purpose of this thesis is to give an overview of the existing software packages and tools, oriented towards the simplification of musicians everyday work. Since the field is quite extensive, only a subset of the available software has been taken into account – mainly programs designed to support preparing and interpreting sheet music. The thesis is divided into two major components – a database (appended on a CD), which contains all the information about the collected data (software, hardware, related bibliography, etc) and the document itself, where the criterions for comparing the software packages are listed and explained together with some illustrative examples. The first two chapters of the document are dedicated to the ways of generating sheet music – describing and comparing the different software tools for displaying and editing sheet music using note graphics software. Also, an overview of intelligent music stands, which is still an underdeveloped branch in this field, is given. The third chapter of the document describes aspects of using music software as a learning intent complemented with some examples of a freeware program. Additionally, a slight overview of digital (sheet)music archives together with some interesting examples is given in the fourth chapter. Also, the field-specific bibliography (comprising years 1989-2012) is presented in the fifth chapter. In consideration of the fact that almost all user interfaces of the software packages use English language, an illustrated English-Estonian dictionary of relevant terms is appended. The database contains 184 entries of topic-related software packages – 4 intelligent music stand applications, 13 digital sheet music converter applications, 98 score editors, 14 study assistant applications and 55 miscellaneous applications; 13 digital note archives and 113 publications

DSpace at Tartu University Library

Adding expressiveness to unit selection speech synthesis and to numerical voice production

Author: Freixes Guerreiro Marc
Publication venue: Blanquerna - Universitat Ramon Llull
Publication date: 18/06/2021
Field of study

La parla és una de les formes de comunicació més naturals i directes entre éssers humans, ja que codifica un missatge i també claus paralingüístiques sobre l’estat emocional del locutor, el to o la seva intenció, esdevenint així fonamental en la consecució d’una interacció humà-màquina (HCI) més natural. En aquest context, la generació de parla expressiva pel canal de sortida d’HCI és un element clau en el desenvolupament de tecnologies assistencials o assistents personals entre altres aplicacions. La parla sintètica pot ser generada a partir de parla enregistrada utilitzant mètodes basats en corpus com la selecció d’unitats (US), que poden aconseguir resultats d’alta qualitat però d’expressivitat restringida a la pròpia del corpus. A fi de millorar la qualitat de la sortida de la síntesi, la tendència actual és construir bases de dades de veu cada cop més grans, seguint especialment l’aproximació de síntesi anomenada End-to-End basada en tècniques d’aprenentatge profund. Tanmateix, enregistrar corpus ad-hoc per cada estil expressiu desitjat pot ser extremadament costós o fins i tot inviable si el locutor no és capaç de realitzar adequadament els estils requerits per a una aplicació donada (ex: cant en el domini de la narració de contes). Alternativament, nous mètodes basats en la física de la producció de veu s’han desenvolupat a la darrera dècada gràcies a l’increment en la potència computacional. Per exemple, vocals o diftongs poden ser obtinguts utilitzant el mètode d’elements finits (FEM) per simular la propagació d’ones acústiques a través d’una geometria 3D realista del tracte vocal obtinguda a partir de ressonàncies magnètiques (MRI). Tanmateix, atès que els principals esforços en aquests mètodes de producció numèrica de veu s’han focalitzat en la millora del modelat del procés de generació de veu, fins ara s’ha prestat poca atenció a la seva expressivitat. A més, la col·lecció de dades per aquestes simulacions és molt costosa, a més de requerir un llarg postprocessament manual com el necessari per extreure geometries 3D del tracte vocal a partir de MRI. L’objectiu de la tesi és afegir expressivitat en un sistema que genera veu neutra, sense haver d’adquirir dades expressives del locutor original. Per un costat, s’afegeixen capacitats expressives a un sistema de conversió de text a parla basat en selecció d’unitats (US-TTS) dotat d’un corpus de veu neutra, per adreçar necessitats específiques i concretes en l’àmbit de la narració de contes, com són la veu cantada o situacions de suspens. A tal efecte, la veu és parametritzada utilitzant un model harmònic i transformada a l’estil expressiu desitjat d’acord amb un sistema expert. Es presenta una primera aproximació, centrada en la síntesi de suspens creixent per a la narració de contes, i es demostra la seva viabilitat pel que fa a naturalitat i qualitat de narració de contes. També s’afegeixen capacitats de cant al sistema US-TTS mitjançant la integració de mòduls de transformació de parla a veu cantada en el pipeline del TTS, i la incorporació d’un mòdul de generació de prosòdia expressiva que permet al mòdul de US seleccionar unitats més properes a la prosòdia cantada obtinguda a partir de la partitura d’entrada. Això resulta en un framework de síntesi de conversió de text a parla i veu cantada basat en selecció d’unitats (US-TTS&S) que pot generar veu parlada i cantada a partir d'un petit corpus de veu neutra (~2.6h). D’acord amb els resultats objectius, l’estratègia de US guiada per la partitura permet reduir els factors de modificació de pitch requerits per produir veu cantada a partir de les unitats de veu parlada seleccionades, però en canvi té una efectivitat limitada amb els factors de modificació de les durades degut a la curta durada de les vocals parlades neutres. Els resultats dels tests perceptius mostren que tot i òbviament obtenir una naturalitat inferior a la oferta per un sintetitzador professional de veu cantada, el framework pot adreçar necessitats puntuals de veu cantada per a la síntesis de narració de contes amb una qualitat raonable. La incorporació d’expressivitat s’investiga també en la simulació numèrica 3D de vocals basada en FEM mitjançant modificacions de les senyals d’excitació glotal utilitzant una aproximació font-filtre de producció de veu. Aquestes senyals es generen utilitzant un model Liljencrants-Fant (LF) controlat amb el paràmetre de forma del pols Rd, que permet explorar el continu de fonació lax-tens a més del rang de freqüències fonamentals, F0, de la veu parlada. S’analitza la contribució de la font glotal als modes d’alt ordre en la síntesis FEM de les vocals cardinals [a], [i] i [u] mitjançant la comparació dels valors d’energia d’alta freqüència (HFE) obtinguts amb geometries realistes i simplificades del tracte vocal. Les simulacions indiquen que els modes d’alt ordre es preveuen perceptivament rellevants d’acord amb valors de referència de la literatura, particularment per a fonacions tenses i/o F0s altes. En canvi, per a vocals amb una fonació laxa i/o F0s baixes els nivells d’HFE poden resultar inaudibles, especialment si no hi ha soroll d’aspiració en la font glotal. Després d’aquest estudi preliminar, s’han analitzat les característiques d’excitació de vocals alegres i agressives d’un corpus paral·lel de veu en castellà amb l’objectiu d’incorporar aquests estils expressius de veu tensa en la simulació numèrica de veu. Per a tal efecte, s’ha usat el vocoder GlottDNN per analitzar variacions d’F0 i pendent espectral relacionades amb l’excitació glotal en vocals [a]. Aquestes variacions es mapegen mitjançant la comparació amb vocals sintètiques en valors d’F0 i Rd per simular vocals que s’assemblin als estils alegre i agressiu. Els resultats mostren que és necessari incrementar l’F0 i disminuir l’Rd respecte la veu neutra, amb variacions majors per a alegre que per agressiu, especialment per a vocals accentuades. Els resultats aconseguits en les investigacions realitzades validen la possibilitat d’afegir expressivitat a la síntesi basada en corpus US-TTS i a la simulació numèrica de veu basada en FEM. Tanmateix, encara hi ha marge de millora. Per exemple, l’estratègia aplicada a la producció numèrica de veu es podria millorar estudiant i desenvolupant mètodes de filtratge invers així com incorporant modificacions del tracte vocal, mentre que el framework US-TTS&S es podria beneficiar dels avenços en tècniques de transformació de veu incloent transformacions de la qualitat de veu, aprofitant l’experiència adquirida en la simulació numèrica de vocals expressives.El habla es una de las formas de comunicación más naturales y directas entre seres humanos, ya que codifica un mensaje y también claves paralingüísticas sobre el estado emocional del locutor, el tono o su intención, convirtiéndose así en fundamental en la consecución de una interacción humano-máquina (HCI) más natural. En este contexto, la generación de habla expresiva para el canal de salida de HCI es un elemento clave en el desarrollo de tecnologías asistenciales o asistentes personales entre otras aplicaciones. El habla sintética puede ser generada a partir de habla gravada utilizando métodos basados en corpus como la selección de unidades (US), que pueden conseguir resultados de alta calidad, pero de expresividad restringida a la propia del corpus. A fin de mejorar la calidad de la salida de la síntesis, la tendencia actual es construir bases de datos de voz cada vez más grandes, siguiendo especialmente la aproximación de síntesis llamada End-to-End basada en técnicas de aprendizaje profundo. Sin embargo, gravar corpus ad-hoc para cada estilo expresivo deseado puede ser extremadamente costoso o incluso inviable si el locutor no es capaz de realizar adecuadamente los estilos requeridos para una aplicación dada (ej: canto en el dominio de la narración de cuentos). Alternativamente, nuevos métodos basados en la física de la producción de voz se han desarrollado en la última década gracias al incremento en la potencia computacional. Por ejemplo, vocales o diptongos pueden ser obtenidos utilizando el método de elementos finitos (FEM) para simular la propagación de ondas acústicas a través de una geometría 3D realista del tracto vocal obtenida a partir de resonancias magnéticas (MRI). Sin embargo, dado que los principales esfuerzos en estos métodos de producción numérica de voz se han focalizado en la mejora del modelado del proceso de generación de voz, hasta ahora se ha prestado poca atención a su expresividad. Además, la colección de datos para estas simulaciones es muy costosa, además de requerir un largo postproceso manual como el necesario para extraer geometrías 3D del tracto vocal a partir de MRI. El objetivo de la tesis es añadir expresividad en un sistema que genera voz neutra, sin tener que adquirir datos expresivos del locutor original. Per un lado, se añaden capacidades expresivas a un sistema de conversión de texto a habla basado en selección de unidades (US-TTS) dotado de un corpus de voz neutra, para abordar necesidades específicas y concretas en el ámbito de la narración de cuentos, como son la voz cantada o situaciones de suspense. Para ello, la voz se parametriza utilizando un modelo harmónico y se transforma al estilo expresivo deseado de acuerdo con un sistema experto. Se presenta una primera aproximación, centrada en la síntesis de suspense creciente para la narración de cuentos, y se demuestra su viabilidad en cuanto a naturalidad y calidad de narración de cuentos. También se añaden capacidades de canto al sistema US-TTS mediante la integración de módulos de transformación de habla a voz cantada en el pipeline del TTS, y la incorporación de un módulo de generación de prosodia expresiva que permite al módulo de US seleccionar unidades más cercanas a la prosodia cantada obtenida a partir de la partitura de entrada. Esto resulta en un framework de síntesis de conversión de texto a habla y voz cantada basado en selección de unidades (US-TTS&S) que puede generar voz hablada y cantada a partir del mismo pequeño corpus de voz neutra (~2.6h). De acuerdo con los resultados objetivos, la estrategia de US guiada por la partitura permite reducir los factores de modificación de pitch requeridos para producir voz cantada a partir de las unidades de voz hablada seleccionadas, pero en cambio tiene una efectividad limitada con los factores de modificación de duraciones debido a la corta duración de las vocales habladas neutras. Los resultados de las pruebas perceptivas muestran que, a pesar de obtener una naturalidad obviamente inferior a la ofrecida por un sintetizador profesional de voz cantada, el framework puede abordar necesidades puntuales de voz cantada para la síntesis de narración de cuentos con una calidad razonable. La incorporación de expresividad se investiga también en la simulación numérica 3D de vocales basada en FEM mediante modificaciones en las señales de excitación glotal utilizando una aproximación fuente-filtro de producción de voz. Estas señales se generan utilizando un modelo Liljencrants-Fant (LF) controlado con el parámetro de forma del pulso Rd, que permite explorar el continuo de fonación laxo-tenso además del rango de frecuencias fundamentales, F0, de la voz hablada. Se analiza la contribución de la fuente glotal a los modos de alto orden en la síntesis FEM de las vocales cardinales [a], [i] y [u] mediante la comparación de los valores de energía de alta frecuencia (HFE) obtenidos con geometrías realistas y simplificadas del tracto vocal. Las simulaciones indican que los modos de alto orden se prevén perceptivamente relevantes de acuerdo con valores de referencia de la literatura, particularmente para fonaciones tensas y/o F0s altas. En cambio, para vocales con una fonación laxa y/o F0s bajas los niveles de HFE pueden resultar inaudibles, especialmente si no hay ruido de aspiración en la fuente glotal. Después de este estudio preliminar, se han analizado las características de excitación de vocales alegres y agresivas de un corpus paralelo de voz en castellano con el objetivo de incorporar estos estilos expresivos de voz tensa en la simulación numérica de voz. Para ello, se ha usado el vocoder GlottDNN para analizar variaciones de F0 y pendiente espectral relacionadas con la excitación glotal en vocales [a]. Estas variaciones se mapean mediante la comparación con vocales sintéticas en valores de F0 y Rd para simular vocales que se asemejen a los estilos alegre y agresivo. Los resultados muestran que es necesario incrementar la F0 y disminuir la Rd respecto la voz neutra, con variaciones mayores para alegre que para agresivo, especialmente para vocales acentuadas. Los resultados conseguidos en las investigaciones realizadas validan la posibilidad de añadir expresividad a la síntesis basada en corpus US-TTS y a la simulación numérica de voz basada en FEM. Sin embargo, hay margen de mejora. Por ejemplo, la estrategia aplicada a la producción numérica de voz se podría mejorar estudiando y desarrollando métodos de filtrado inverso, así como incorporando modificaciones del tracto vocal, mientras que el framework US-TTS&S desarrollado se podría beneficiar de los avances en técnicas de transformación de voz incluyendo transformaciones de la calidad de la voz, aprovechando la experiencia adquirida en la simulación numérica de vocales expresivas.Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications. Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI. The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality. The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels. The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels

Tesis Doctorals en Xarxa