Search CORE

84 research outputs found

Comparison between rule-based and data-driven natural language processing algorithms for Brazilian Portuguese speech synthesis

Author: Vecchietti Luiz Felipe Santos
Publication venue: 'Programa de Pos-graduacao em Ciencias Contabeis da UFRJ'
Publication date: 01/04/2017
Field of study

Due to the exponential growth in the use of computers, personal digital assistants and smartphones, the development of Text-to-Speech (TTS) systems have become highly demanded during the last years. An important part of these systems is the Text Analysis block, that converts the input text into linguistic specifications that are going to be used to generate the final speech waveform. The Natural Language Processing algorithms presented in this block are crucial to the quality of the speech generated by synthesizers. These algorithms are responsible for important tasks such as Grapheme-to-Phoneme Conversion, Syllabification and Stress Determination. For Brazilian Portuguese (BP), solutions for the algorithms presented in the Text Analysis block have been focused in rule-based approaches. These algorithms perform well for BP but have many disadvantages. On the other hand, there is still no research to evaluate and analyze the performance of data-driven approaches that reach state-of-the-art results for complex languages, such as English. So, in this work, we compare different data-driven approaches and rule-based approaches for NLP algorithms presented in a TTS system. Moreover, we propose, as a novel application, the use of Sequence-to-Sequence models as solution for the Syllabification and Stress Determination problems. As a brief summary of the results obtained, we show that data-driven algorithms can achieve state-of-the-art performance for the NLP algorithms presented in the Text Analysis block of a BP TTS system.Nos últimos anos, devido ao grande crescimento no uso de computadores, assistentes pessoais e smartphones, o desenvolvimento de sistemas capazes de converter texto em fala tem sido bastante demandado. O bloco de análise de texto, onde o texto de entrada é convertido em especificações linguísticas usadas para gerar a onda sonora final é uma parte importante destes sistemas. O desempenho dos algoritmos de Processamento de Linguagem Natural (NLP) presentes neste bloco é crucial para a qualidade dos sintetizadores de voz. Conversão Grafema-Fonema, separação silábica e determinação da sílaba tônica são algumas das tarefas executadas por estes algoritmos. Para o Português Brasileiro (BP), os algoritmos baseados em regras têm sido o foco na solução destes problemas. Estes algoritmos atingem bom desempenho para o BP, contudo apresentam diversas desvantagens. Por outro lado, ainda não há pesquisa no intuito de avaliar o desempenho de algoritmos data-driven, largamente utilizados para línguas complexas, como o inglês. Desta forma, expõe-se neste trabalho uma comparação entre diferentes técnicas data-driven e baseadas em regras para algoritmos de NLP utilizados em um sintetizador de voz. Além disso, propõe o uso de Sequence-to-Sequence models para a separação silábica e a determinação da tonicidade. Em suma, o presente trabalho demonstra que o uso de algoritmos data-driven atinge o estado-da-arte na performance dos algoritmos de Processamento de Linguagem Natural de um sintetizador de voz para o Português Brasileiro

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Pantheon

Liaison and pronunciation learning in end-to-end text-to-speech in French

Author: Maguer Sébastien Le
Richmond Korin
Taylor Jason
Publication venue: 'International Speech Communication Association'
Publication date: 28/08/2021
Field of study

Edinburgh Research Explorer

Neural Network vs. Rule-Based G2P: A Hybrid Approach to Stress Prediction and Related Vowel Reduction in Bulgarian

Author: Karamihaylova Maria
Publication venue: CUNY Academic Works
Publication date: 01/06/2023
Field of study

An effective grapheme-to-phoneme (G2P) conversion system is a critical element of speech synthesis. Rule-based systems were an early method for G2P conversion. In recent years, machine learning tools have been shown to outperform rule-based approaches in G2P tasks. We investigate neural network sequence-to-sequence modeling for the prediction of syllable stress and resulting vowel reductions in the Bulgarian language. We then develop a hybrid G2P approach which combines manually written grapheme-to-phoneme mapping rules with neural network-enabled syllable stress predictions by inserting stress markers in the predicted stress position of the transcription produced by the rule-based finite-state transducer. Finally, we apply vowel reduction rules in relation to the position of the stress marker to yield the predicted phonetic transcription of the source Bulgarian word written in Cyrillic graphemes. We compare word error rates between the neural network sequence-to-sequence modeling approach with the hybrid approach and find no significant difference between the two. We conclude that our hybrid approach to syllable stress, vowel reduction, and transcription performs as well as the exclusively machine learning powered approach

City University of New York

Results of the Second SIGMORPHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion

Author: Ashby Lucas F.E.
Bartley Travis M.
Clematide Simon
Del Signore Luca
Gibson Cameron
Gorman Kyle
Lee-Sikka Yeonju
Makarov Peter
Malanoski Aidan
Miller Sean
Ortiz Omar
Raff Reuben
Sengupta Arundhati
Seo Bora
Spektor Yulia
Yan Winnie
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 05/08/2021
Field of study

Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The second iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year's task (Gorman et al. 2020), including additional languages, a stronger baseline, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Four teams submitted a total of thirteen systems, at best achieving relative reductions of word error rate of 11% in the high-resource subtask and 4% in the low-resource subtask

ZORA

Recommended from our members

Deep Learning for Automatic Assessment and Feedback of Spoken English

Author: Kyriakopoulos Konstantinos
Publication venue: University of Cambridge
Publication date: 12/03/2022
Field of study

Growing global demand for learning a second language (L2), particularly English, has led to considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications. This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One of the challenges in automatic spoken language assessment is giving candidates feedback on particular aspects, or views, of their spoken language proficiency, in addition to the overall holistic score normally provided. Another is detecting pronunciation and other types of errors at the word or utterance level and feeding them back to the learner in a useful way. It is usually difficult to obtain accurate training data with separate scores for different views and, as examiners are often trained to give holistic grades, single-view scores can suffer issues of consistency. Conversely, holistic scores are available for various standard assessment tasks such as Linguaskill. An investigation is thus conducted into whether assessment scores linked to particular views of the speaker’s ability can be obtained from systems trained using only holistic scores. End-to-end neural systems are designed with structures and forms of input tuned to single views, specifically each of pronunciation, rhythm, intonation and text. By training each system on large quantities of candidate data, individual-view information should be possible to extract. The relationships between the predictions of each system are evaluated to examine whether they are, in fact, extracting different information about the speaker. Three methods of combining the systems to predict holistic score are investigated, namely averaging their predictions and concatenating and attending over their intermediate representations. The combined graders are compared to each other and to baseline approaches. The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. An approach to these tasks is presented by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora x of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level

Apollo (Cambridge)

Pronunciation modelling in end-to-end text-to-speech synthesis

Author: Taylor Jason
Publication venue: The University of Edinburgh
Publication date: 13/06/2022
Field of study

Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot

Edinburgh Research Archive

Preprocessing models for speech technologies : the impact of the normalizer and the grapheme-to-phoneme on hybrid systems

Author: Carriço Bruna dos Santos
Publication venue
Publication date: 15/09/2022
Field of study

Um dos usos mais promissores e de crescimento mais rápido da tecnologia de linguagem natural corresponde às Tecnologias de Processamento da Fala. Esses sistemas usam tecnologia de reconhecimento automático de fala e conversão de texto em fala para fornecer uma interface de voz para aplicações de conversão. Com efeito, esta tecnologia está presente em diversas situações do nosso quotidiano, tais como assistentes virtuais em smartphones (como a SIRI ou Alexa), ou sistemas de interação por voz em automóveis. As tecnologias de fala evoluíram progressivamente até ao ponto em que os sistemas podem prestar pouca atenção à sua estrutura linguística. Com efeito, o Conhecimento Linguístico pode ser extremamente importante numa arquitetura de fala, particularmente numa fase de pré-processamento de dados: combinar conhecimento linguístico em modelo de tecnologia de fala permite produzir sistemas mais confiáveis e robustos. Neste sentido, o pré-processamento de dados é uma etapa fundamental na construção de um modelo de Inteligência Artificial (IA). Se os dados forem razoavelmente pré-processados, os resultados serão consistentes e de alta qualidade (García et al., 2016). Por exemplo, os sistemas mais modernos de reconhecimento de fala permitem modelizar entidades linguísticas em vários níveis, frases, palavras, fones e outras unidades, usando várias abordagens estatísticas (Jurafsky & Martin, 2022). Apesar de treinados sobre dados, estes sistemas são tão mais precisos quanto mais eficazes e eficientes a capturarem o conhecimento linguístico. Perante este cenário, este trabalho descreve os métodos de pré-processamento linguístico em sistemas híbridos (de inteligência artificial combinada com conhecimento linguístico) fornecidos por uma empresa internacional de Inteligência Artificial (IA), a Defined.ai. A start-up concentra-se em fornecer dados, modelos e ferramentas de alta qualidade para IA., a partir da sua plataforma de crowdsourcing Neevo. O utilizador da plataforma tem acesso a pequenas tarefas de anotação de dados, tais como: transcrição, gravação e anotação de áudios, validação de pronúncia, tradução de frases, classificação de sentimentos num texto, ou até extração de informação a partir de imagens e vídeos. Até ao momento, a empresa conta com mais de 500,000 utilizadores de 70 países e 50 línguas diferentes. Através duma recolha descentralizada dos dados, a Defined.ai responde à necessidade crescente de dados de treino que sejam justos, i.e., que não reflitam e/ou amplifiquem os padrões de discriminação vigentes na nossa sociedade (e.g., de género, raça, orientação sexual). Como resultado, a Defined.ai pode ser vista como uma comunidade de especialistas em IA, que produz sistemas justos, éticos e de futuro. Assim, o principal objetivo deste trabalho é aprimorar e avançar a qualidade dos modelos de pré-processamento, aplicando-lhes conhecimento linguístico. Assim, focamo-nos em dois modelos linguísticos introdutórios numa arquitetura de fala: Normalizador e Grafema-Fonema. Para abordar o assunto principal deste estudo, vamos delinear duas iniciativas realizadas em colaboração com a equipa de Machine learning da Defined.ai. O primeiro projeto centra-se na expansão e melhoria de um modelo Normalizador pt-PT. O segundo projeto abrange a criação de modelos Grafema-Fonema (do inglês Grapheme-to-phoneme, G2P) para duas línguas diferentes – Sueco e Russo. Os resultados mostram que ter uma abordagem baseada em regras para o Normalizador e G2P aumenta a sua precisão e desempenho, representado uma vantagem significativa na melhoria das ferramentas da Defined.ai e nas arquiteturas de fala. Além disso, com os resultados obtidos no primeiro projeto, melhoramos o normalizador na sua facilidade de uso, aumentando cada regra com o respetivo conhecimento linguístico. Desta forma, a nossa pesquisa demonstra o valor e a importância do conhecimento linguístico em modelos de pré-processamento. O primeiro projeto teve como objetivo fornecer cobertura para diversas regras linguísticas: Números Reais, Símbolos, Abreviaturas, Ordinais, Medidas, Moeda, Datas e Hora. A tarefa consistia em expandir as regras com suas respetivas expressões normalizadas a partir de regras a seguir que teriam uma leitura não marcada inequívoca própria. O objetivo principal é melhorar o normalizador tornando-o mais simples, consistente entre diferentes linguagens e de forma a cobrir entradas não ambíguas. Para preparar um modelo G2P para dois idiomas diferentes - Sueco e Russo - quatro tarefas foram realizadas: 1. Preparar uma análise linguística de cada língua, 2. Desenvolver um inventário fonético-fonológico inicial, 3. Mapear e converter automaticamente o léxico fonético para DC-Arpabet (o alfabeto fonético que a Defined.ai construiu), 4. Rever e corrigir o léxico fonético, e 4. Avaliar o modelo Grafema-Fonema. A revisão dos léxicos fonéticos foi realizada, em consulta com a nossa equipa da Defined.ai, por linguistas nativos que verificaram se os inventários fonéticos-fonológicos seriam adequados para transcrever. Segundo os resultados de cada modelo, nós avaliamos de acordo com 5 métricas padrão na literatura: Word Error Rate (WER), Precision, Recall, F1-score e Accuracy. Adaptamos a métrica WER para Word Error Rate over normalizable tokens (WERnorm) por forma a responder às necessidades dos nossos modelos. A métrica WER (ou taxa de erro por palavra) foi adaptada de forma a contabilizar tokens normalizáveis, em vez de todos os tokens. Deste modo, a avaliação do normalizador, avalia-se usando um conjunto de aproximadamente 1000 frases de referência, normalizadas manualmente e marcadas com a regra de normalização que deveria ser aplicada (por exemplo, números reais, símbolos, entre outros). De acordo com os resultados, na versão 2 do normalizador, obtivemos discrepâncias estatisticamente significativas entre as regras. A regra dos ordinais apresenta a maior percentagem (94%) e as abreviaturas (43%) o menor percentual. Concluímos também um aumento significativo no desempenho de algumas das regras. Por exemplo, as abreviaturas mostram um desempenho de 23 pontos percentuais (pp.) superior. Quando comparamos as duas versões, concluímos que a versão 2 do normalizador apresenta, em média, uma taxa de erro 4 pp. menor sobre os tokens normalizáveis em comparação com a versão 1. Assim, o uso da regra dos ordinais (94% F1-score) e da regra dos números reais (89% F1-score) é a maior fonte de melhoria no normalizador. Além disso, em relação à precisão, a versão 2 apresenta uma melhoria de, em média, 28 pp em relação à versão 1. No geral, os resultados revelam inequivocamente uma melhoria da performance do normalizador em todas as regras aplicadas. De acordo com os resultados do segundo projeto, o léxico fonético sueco alcançou um WER de 10%, enquanto o léxico fonético russo um WER ligeiramente inferior (11%). Os inventários fonético-fonológicos suecos apresentam uma precisão maior (97%) do que os inventários fonético-fonológicos russos (96%). No geral, o modelo sueco G2P apresenta um melhor desempenho (98%), embora a sua diferença ser menor quando comparado ao modelo russo (96%). Em conclusão, os resultados obtidos tiveram um impacto significativo na pipeline de fala da empresa e nas arquiteturas de fala escrita (15% é a arquitetura de fala). Além disso, a versão 2 do normalizador começou a ser usada noutros projetos do Defined.ai, principalmente em coleções de prompts de fala. Observamos que nossa expansão e melhoria na ferramenta abrangeu expressões que compõem uma proporção considerável de expressões normalizáveis, não limitando a utilidade da ferramenta, mas aumentando a diversidade que ela pode oferecer ao entregar prompts, por exemplo. Com base no trabalho desenvolvido, podemos observar que, ao ter uma abordagem baseada em regras para o Normalizador e o G2P, conseguimos aumentar a sua precisão e desempenho, representando não só uma vantagem significativa na melhoria das ferramentas da Defined.ai, como também nas arquiteturas de fala. Além disso, a nossa abordagem também foi aplicada a outras línguas obtendo resultados muito positivos e mostrando a importância da metodologia aplicada nesta tese. Desta forma, o nosso trabalho mostra a relevância e o valor acrescentado de aplicar conhecimento linguístico a modelos de pré-processamento.One of the most fast-growing and highly promising uses of natural language technology is in Speech Technologies. Such systems use automatic speech recognition (ASR) and text-to-speech (TTS) technology to provide a voice interface for conversational applications. Speech technologies have progressively evolved to the point where they pay little attention to their linguistic structure. Indeed, linguistic knowledge can be extremely important in a speech pipeline, particularly in the Data Preprocessing phase: combining linguistic knowledge in a speech technology model allows producing more reliable and robust systems. Given this background, this work describes the linguistic preprocessing methods in hybrid systems provided by an Artificial Intelligence (AI) international company, Defined.ai. The startup focuses on providing high-quality data, models, and AI tools. The main goal of this work is to enhance and advance the quality of preprocessing models by applying linguistic knowledge. Thus, we focus on two introductory linguistic models in a speech pipeline: Normalizer and Grapheme-to-Phoneme (G2P). To do so, two initiatives were conducted in collaboration with the Defined.ai Machine Learning team. The first project focuses on expanding and improving a pt-PT Normalizer model. The second project covers creating G2P models for two different languages – Swedish and Russian. Results show that having a rule-based approach to the Normalizer and G2P increases its accuracy and performance, representing a significant advantage in improving Defined.ai tools and speech pipelines. Also, with the results obtained on the first project, we improved the normalizer in ease of use by increasing each rule with linguistic knowledge. Accordingly, our research demonstrates the added value of linguistic knowledge in preprocessing models

Universidade de Lisboa: Repositório.UL

Automatic Scansion of Poetry

Author: Aguirrezabal Zabaleta Manex
Publication venue
Publication date: 01/01/2017
Field of study

146 p.Lan honetan poesiaren eskantsioa, hau da, poemetako egitura erritmikoaren erauztea, burutzen duguautomatikoki. Horretarako hizkuntzaren prozesamenduko ohiko teknikak erabili ditugu. Metodo batzukerregeletan oinarritutakoak dira, beste batzuk berriz, datuetan oinarritutakoak. Emaitzek iradokitzendute emaitzarik onenak datuetan oinarritutako sistemekin lortutakoak direla.1.- SarreraLehen zutabean dagoen poema osorik irakurrita, erritmo gorabeheratsu (TA-TAN-TA-TAN) konstantebat hauteman daiteke. Bigarren zutabeko lehen adibidea ahoz irakurriko bagenu, TA-RA-TAN modukosoinu bat hautemango genuke. Bigarren adibidea, aldiz, gaztelerazko hendekasilabo bat da, beraz,hamaika soinu unitateko lerroa dugu hura, azken aurreko silaba azentudunarekin. Baina, posible allitzateke horrelako egiturak antzematea hizkuntzaren erabateko ezagutza izan gabe? edo, are gehiago,hizkuntzari buruzko inolako informaziorik gabe, topa al daitezke halako patroiak? HizkuntzarenProzesamenduaren arloko erronkatzat har dezakegu poemetako patroi prosodikoen hautemate hau.Uneko hizkuntzari buruzko informaziorik izan gabe egitura prosodiko hau erauzteko, tradizio poetikoezberdinen azterketa tipologiko bat egitea beharrezkoa dela uste dugu. Bide horretan lehen pausuakemateko ikerlan hau aurkezten dugu, non poesiaren egitura prosodikoa automatikoki aztertzen dugunhizkuntzaren prozesamenduko algoritmo batzuk erabilita. Metodo hauek ingelesezko poemetanaplikatu ditugu emaitza onak lortuaz, eta eredu hoberenak gaztelerazko eta euskarazko corpus banatanaplikatu ditugu.Honako egitura jarraitzen du testu honek: Bigarren atalean eskantsioa definitzen dugu eta tradiziopoetiko ezberdinak aurkezten. Horretaz aparte, poesiaren analisi automatikoaren inguruan egin direnlan batzuk zerrendatzen ditugu. Hirugarren atala lanaren muina dela esan dezakegu, hor aurkeztenbaititugu lan honetarako erabili ditugun corpusak, metodoak eta egindako esperimentuak. Bukaeran,laugarren atalean, esperimentuen ondorioak jartzen ditugu.2.- EskantsioaPoema lerro batean eskantsioa egitea poema horren egitura erritmikoa erauztea da, azentuak, oinaketa errimak adierazita (Baldick, 2015). Lan honetan, ordea, lerro bakoitzaren azentu sekuentzia soilikinferitzen dugu.2.1 Poesia ingelesezHainbat liburu idatzi dira ingelesezko poesiaren prosodiaren inguruan, Halle eta Keyser (1971); Corn(1997); Fabb (1997) eta Steele (1999), adibidez. Ingelesezko poesian silabak oin izeneko multzoetanelkartzen dira. Multzo hauek hainbat silabez osatuta daude, baina ohikoenak bi edo hiru silabakomultzoak dira. Oin hauetako bakoitzak gutxienez gailentzen den silaba bat izango du, azentuatuakontsideratuko duguna. Egitura ohikoenak ianbikoa (bal-loon), trokaikoa (jun-gle), daktilikoa (ac-cident)eta anapestikoa (but I¿m tel-ling you Liz ) dira (Baldick, 2015).Metrika tradizionalaren arabera (Fussell, 1965; Steele, 1999), honelako oinez osatua egongo da lerrometriko oro. Lerroon luzera oin kopuruaren araberakoa izango da, beraz, trimetro batek hiru oin izangoditu, tetrametro batek lau, pentametro batek bost, etab. (hexametro, heptametro, . . . ). Ingelesezkopoesian metrika arruntena pentametro ianbikoa da, adibidez,oh change thy thought, that I may change my mind.non bost azentu argi nabaritzen diren eta TA-TAN multzo bakoitzak oin bat osatzen duen. Poemokorokorrean erregularrak diren arren, ohikoa da aldaketa txiki batzuk egitea egiturotan, helburu estetikoedota artistikoekin.Grant if thou wilt, thou art beloved of manyAurreko adibidearekin alderatuta, honetan hasieran TAN-TA-TA-TAN moduko soinu bat antzematenda. Aldaketa honi, literaturan bariazio trokaiko deitzen zaio. Gainera, lerroa ianbikoa izanda, bukaeraktonikoa behar luke izan, baina aldaketa ohikoa da silaba azentudun baten ostean silaba ez-azentudunbat gehitzea lerroaren bukaeran.2.2 Poesia gaztelaniazGaztelerazko poesian hainbat egitura metriko erabili izan dira (Quilis, 1984; Toma¿s, 1995; Caparro¿s,1999). Lan honetan, corpusaren eskuragarritasuna medio, garai espezifiko batean soilik egin duguenfasia, Espainiako Urrezko Aroan, alegia. Garai honetan gehien erabilitako metrika hendekasilaboaizan zen, lerro bakoitza hamaika silabez osaturik. Lerroetako azentu sekuentzia nahiko erregularra daeta normalean hamargarren silabak azentua darama. Beste silabek ere azentua izan dezakete, etanabarmendutako posizio horien arabera, hendekasilabo hauek hainbat motatakoak izan daitezke.Gaztelerazko poesiaren erronka handienetako bat silaba laburketen erabilera da, sinalefa gisa ezagutzendena, non hamaika silaba baino gehiago dituzten lerroak hamaika silabetan ahokatzen diren. Lan honenhelburua silaba bakoitzari azentu bat automatikoki esleitzea da, ondorioz, metodo erdi-automatiko baterabili dugu sinalefak dauden kasuetan lerroko silaba bakoitzari azentu balio bat esleitzeko.2.3 Poesia euskarazGaur egungo poesian, eta bereziki bertsolaritzan, neurri ezagunik bada, neurri txikiak eta handiak dira.Neurri txikiek lerro bakoitietan zazpi silaba izaten dituzte eta bikoitietan sei. Handiek, ordea, hamarsilaba eta zortzi silaba izaten dituzte lerro bikoiti eta bakoitietan, hurrenez hurren. Ez dira hauek, ordea,poesian erabiltzen diren neurri bakarrak. Idatzizko poesian ohikoa da zortziko ertainaren erabilera, nonlerro bakoitiek zortzi silaba dituzten eta bikoitiek zazpi. Neurri gehienetan lerro bikoitiek elkarrekinerrimatu behar dute.Ikerlan honetan azentuei erreparatzen diegu eta oraindik ez dago argi ea euskarazko poesian azentuekeragin nabarmena duten ala ez. Hainbat adituk idatzi izan dute euskal poesia eta haren neurkerariburuz, XVII. mendetik hasita. Hauek irakurtzean ikuspegi kontrajarriak topa daitezke. Batzuen arabera¿Oihenart eta aita Onaindia, kasu¿ euskal poesian erritmoak garrantzia du, eta poema oroknolabaiteko erritmoa izan behar du.¿Literatur guztiak dabez euren lege ta arauak, olerkigintzan bereziki; euskeran be naitaez izan bear.Lau gauza oneik beintzat gogotan artu bearrak doguz: 1) Igikera (ritmu); 2) etena (cesura); 3)neurria, ta 4) oskide edo azken amaitze bardin¿a (rima).¿Onaindia (1961)Beste batzuk, berriz, euskaraz azentuak eraginik ez duela dio. Nikolas Ormaetxea ¿Orixe¿ da horiesaten duen poeta bat.¿Para probar lo poco sensible que es el acento vasco, inte¿ntese colocar acentos gra¿ficos en las silabasque uno crea acentuadas, enca¿rguese el trabajo a cien personas de buen oido y en una pa¿gina que sesometa al ana¿lisis, se puede asegurar sin temor, que no habra¿ dos que coincidan.¿Ormaechea (1920)2.4 Eskantsio automatikoaAzken urteotan eskantsio automatikoaren inguruan lan ezberdinak egin dira. Lan hauetan, hitzsekuentzia bat sarrera gisa jasota, hauek jarraitzen duten azentu sekuentzia itzultzea izan ohi da burutubeharreko ataza. Itzulpen edo transdukzio prozesu hau hainbat modutara egin daiteke:¿ Erregeletan oinarrituta: Adituek ezarritako arauak jarraituta, hainbat ezaugarri linguistikokontutan izanda.¿ Datuetan oinarrituta: Etiketatutako informazioan oinarrituta, testutik azentuetarako patroiakautomatikoki ikasita. Ildo honi jarraitu diogu aurkezten dugun lan honetan.Urteotan aurkeztu diren lanen artean, arauetan oinarritutakoak Logan (1988); Gervas (2000); Hartman(2005); Plamondon (2006); McAleese (2007); Navarro-Colorado (2015) eta Agirrezabal et al. (2016b)ditugu. Geroz eta entzute handiagoa dute datuetan oinarritutako metodoek, etiketatutakoinformazioaren eskuragarritasuna dela eta. Hauen artean Hayward (1996); Greene et al. (2010); Hayeset al. (2012); Agirrezabal et al. (2016a) eta Estes eta Hench (2016) azpimarratu ditzakegu.3 Corpusak, metodoak eta esperimentuak3.1 CorpusakDatuetan oinarritutako sistemen garapenerako edo erregeletan oinarritutako sistemen ebaluaziorakodatu etiketatuak izatea ezinbestekoa da. Horretarako hiru corpus erabiltzen ditugu, ingelesezko bat,gaztelerazko bat eta euskarazko beste bat. Ingelesezko lanetarako Virginiako unibertsitatean garatutako¿For Better For Verse¿ proiektuaren (Tucker, 2011) emaitza izan den poesia corpusa erabili dugu.Corpus honetan 78 poema daude eta guztira 1.100 poema lerro. Eskantsioa egiterako orduan, lerrobatzuk hainbat analisi izan ditzakete, eta hauek corpusean horrela daude (hainbat aukerarekin).Gaztelerazko esperimentuetarako, lehenago aipatu gisa, Espainiako Urrezko Aroko corpus bat erabilidugu (Navarro-Colorado et al., 2016). Etiketatutako corpusa 135 sonetoz osatuta dago eta gutxigorabehera 2.000 lerro ditu. Euskarazko esperimentuetarako, Patri Urkizuren ¿Poesía vasca: Antologíabilingüe¿ bilduma oinarri hartuta, corpus bat bildu eta eskuz etiketatu dugu. Corpus honek 38 poemaditu eta 2000 lerro inguru.3.2 MetodoakLehen esperimentuak ingelesez egin ditugu eta horiek oinarritzat hartuta, metodo hoberenak gazteleraraeta euskarara estrapolatu ditugu. Lehenik eta behin, erregeletan oinarritutako sistema bat garatu duguinglesezko poesia analizatzeko. Horren ondoren, datuetan oinarritutako tekniketara egin dugu jauzi.Hizkuntzaren prozesamenduan ohikoak diren teknikak aplikatu ditugu datuotatik patroiak ikasi etaaurretik ikusi gabeko poemetan aplikatu ahal izateko. Erabili ditugun teknikak hiru multzotan sailkaditzakegu. Batetik sailkapen arrunta egiten dutenak, sailkapen egituratua egiten dutenak eta sareneuronaletan oinarritutako teknikak.Erabilitako tekniketatik hoberenak perzeptroia (Perceptron) (Freund eta Schapire, 1999), Markoveneredu ezkutuak (Hidden Markov Models) (Rabiner, 1989), ausazko eremu baldintzatuak (ConditionalRandom Fields) (Lafferty et al., 2001) edota epe laburreko memoria luzedun sare neuronalerrekurrenteak (Recurrent Neural Networks with Long Short-Term Memory) (Lample et al., 2016) dira.Teknika eta konfigurazio ezberdinak ebaluatzeko, metodo ezberdinak erabil daitezke. Datu kopuruaoso handia ez denean, gure kasuan bezala, balidazio gurutzatua (K-fold Cross-Validation) erabiltzea daohikoena. Balidazio gurutzatuan datu multzoa k zatitan banatzen da. Behin zati horiek eginda, k ¿ 1zati erabiltzen dira eredu bat ikasteko eta ebaluaziorako bat gordetzen da. Hau k aldiz egiten da, etaasmatze-tasaren batazbestekoa itzultzen da. Gure kasuan, 10 zatitan banatu dugu gure datu-multzoa.3.3 EbaluazioaOndorengo taulan, datuetan oinarritutako metodo hoberenen asmatze-tasak ageri dira. Asmatze-tasahauek silaba mailan kalkulatzen dira.Ondorengo taulan, metodoek lerro mailan lortutako emaitzak agertzen dira.Emaitzen taulan ikus daitekeen moduan, sare neuronaletan oinarritutako sistemek ematen dituzteemaitza onenak, bai ingelesez eta baita gazteleraz ere. Taula horretatik hainbat ondorio plazaraditzakegu.4. OndorioakAgirrezabal et al. (2016a) lanean adierazi genuen Perzeptroiean eta CRFetan erabiltzen ditugun 10atributuak poesiaren analisi prosodikorako egokiak ziren atributuak zirela, bereziki interesgarriakhizkuntzarekiko agnostikoak ziruditelako. Esperimentuotan, gazteleraz probak egin ostean, ikusi duguingelesez nahiko emaitza onak ematen dituztela haien sinpletasuna kontutan hartuta. Gaztelerazkodatuetan, ordea, emaitzak ez dira horren onak izan eta horrek iradokitzen digu atributuok ez direlanahikoak hizkuntzarekiko independenteak diren sistemak eraikitzeko. Dena den, hau baieztatzekohizkuntza gehiagorekin egin beharko genituzke esperimentuok.Emaitzak aztertuta, hitz mugak poemetako egitura prosodikoaren inferentzian garrantzi handia duela ondorioztatzen dugu, bereziki gazteleraz. Horren justifikazioa izan daiteke ingelesezko hitzek batazbestean silaba gutxiago dituztela gazteleraz baino, beheko irudian ikus daitekeen bezalaxe.Gainera, badirudi sare neuronaletan oinarritutako ereduek hitzen egitura fonologikoa ondo modelatzendutela, baina hau enpirikoki frogatzeko esperimentu gehiago beharko lirateke

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

Author: Thangthai Ausdang
Publication venue
Publication date: 01/04/2018
Field of study

The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

University of East Anglia digital repository