4 research outputs found

    Automatic Intonation Event Detection Using Tilt Model for Croatian Speech Synthesis

    Get PDF
    Text-to-speech systems convert text into speech. Synthesized speech without prosody sounds unnatural and monotonous. In order to sound natural, prosodic elements have to be implemented. The generation of prosodic elements directly from text is a rather demanding task. Our final goals are building a complete prosodic model for Croatian and implementing it into our TTS system. In this work, we present one of the steps in implementation of prosody into TTSs – detection of intonation events using Tilt intonation model. We propose a training procedure which is composed of several subtasks. First, we hand-labelled a set of utterances and within each of them, marked four types of prosodic events. Then we trained HMMs and used them to mark prosodic events on a larger set of utterances. We estimate parameters for each of the intonation event and generated f0 contours from the parameters. Finally, we evaluated the obtained f0 contours

    Síntesis de voz aplicada a la traducción voz a voz

    Get PDF
    In the field of speech technologies, text-to-speech conversion is the automatic generation of artificial voices that sound identical to a human voice when reading a text in loud speech. Inside a text-to-speech system, the prosody module produces the prosodic information that is necessary to generate a natural voice: intonational phrases, intonation of the sentence, duration and energy of phonemes, etc. The correct generation of this information directly impacts in the naturalness and expressiveness of the system. The main goals of this thesis is the development of new algorithms to train models for prosody generation that may be used in a text-to-speech system, and their use in the framework of speech-to-speech translation. In this thesis several alternatives were studied for intonation modeling. They combine the parameterization and the intonation model generation as a integrated process. Such approach was successfully judged both with objective and subjective evaluations. The influence of segmental and suprasegmental factors in duration modeling was also studied. Several algorithms were proposed with the results of these studies that may combine segmental and suprasegmental information, likewise other publications of this field. Finally, an analysis of various phrase break models was also performed, both with words and accent groups: classification trees (CART), language modeling (LM) and finite state transducers (FST). The use of the same data set in the experiments was useful to obtain relevant conclusions about the differences between these models. One of the main goals of this thesis was the improvement of naturalness, expressiveness and consistency with the style of the source speaker in text-to-speech systems. This may be done by using the prosody of the source speaker in the framework of speech-to-speech translation as an additional information source. Several algorithms were developed for prosody generation that may integrate such additional information for the prediction of intonation, phoneme duration and phrase breaks. In that direction several approaches were studied to transfer the intonation from one language to the other. The chosen approach was an automatic clustering algorithm that finds a certain number of tonal movements that are related between languages, without any limitation about their number. In this way, it is possible to use this coding for intonation modeling of the target language. Experimental results show an improvement, that is more relevant in close languages, such as Spanish and Catalan. Although no segmental duration transfer was performed between languages, in this thesis is proposed the transfer of rhythm from one language to the other. For that purpose a method that combines the rhythm transfer and audio synchronization was proposed. The synchronizations is included because of its importance for the speech-to-speech translation technology when video is also used. Lastly, in this thesis was also proposed a pause transfer technique in the framework of speech-to-speech translation, by means of alignment information. Studies in training data have shown the advantage of tuples for this task. In order to predict any pause that can not be transferred using the before mentioned method, conventional pause prediction algorithms are used (CART, CART+LM, FST), taking into account the already transferred pauses.Dentro de las tecnologías del habla, la conversión texto a voz consiste en la generación, por medios automáticos, de una voz artificial que genera idéntico sonido al producido por una persona al leer un texto en voz alta. En resumen, los conversores texto a voz son sistemas que permiten la conversión de textos en voz sintética. El proceso de conversión texto a voz se divide en tres módulos básicos: procesamiento del texto, generación de la prosodia y generación de la voz sintética. En el primero de los módulos se realiza la normalización del texto (para expandir abreviaciones, convertir números y fechas en texto, etc), y en ocasiones, luego también se hace un etiquetado morfosintáctico. A continuación se procede a la conversión de los grafemas en fonemas y a la silabificación para obtener la secuencia de fonemas necesaria para reproducir el texto. Posteriormente, el módulo de prosodia genera la información prosódica para poder producir la voz. Para ello se predicen las frases entonativas y la entonación de la oración, y también la duración y la energía de los fonemas, etc. La correcta generación de esta información repercutirá directamente en la naturalidad y expresividad del sistema. En el último modulo de generación de la voz es donde se produce la voz considerando la información provista por los módulos de procesamiento del texto y prosodia. El objetivo de la presente tesis es el desarrollo de nuevos algoritmos para el entrenamiento de modelos de generación de prosodia para la conversión texto a voz, y su aplicación en el marco de la traducción voz a voz. En el caso de los algoritmos de modelado de entonación, en la literatura se proponen generalmente enfoques que incluyen una estilización previa a la parametrización. En esta tesis se estudiaron alternativas para evitar esa estilización, combinando la parametrización y la generación del modelo de entonación en un todo integrado. Dicho enfoque ha resultado exitoso tanto en la evaluación objetiva (usando medidas como el error cuadrático medio o el coeficiente de correlación Pearson) como en la subjetiva. Los evaluadores han considerado que el enfoque propuesto tiene una calidad y una naturalidad superiores a otros algoritmos existentes en la literatura incluidos en las evaluaciones, alcanzando un MOS de naturalidad de 3,55 (4,63 para la voz original) y un MOS de calidad de 3,78 (4,78 para la voz original).Postprint (published version

    The Future of Information Sciences : INFuture2011 : Information Sciences and e-Society

    Get PDF

    Automatic prediction and modelling of Croatian prosodic features based on text

    Get PDF
    Ljudski govor prenosi široki raspon informacija sadržanih u naglasnom sustavu, intonaciji, trajanju, ritmu, stankama, govornoj brzini, a ta se obilježja često nazivaju zajedničkim imenom - prozodija. Za hrvatski jezik dosad nisu provedena opsežna istraživanja na temu predviđanja prozodijskih obilježja i njihova modeliranja. U ovoj se disertaciji istražila primjenjivost metoda predviđanja prozodijskih obilježja i njihova modeliranja na hrvatski jezik te mogućnosti njihova poboljšanja uz uključivanje lingvističkih obilježja i jezičnih specifičnosti karakterističnih za hrvatski jezik kao što je primjerice leksički naglasak. Hrvatski jezik pripada grupi ograničenih tonskih jezika u kojima tonska kontura realizirana na naglašenoj riječi nosi leksičku informaciju pa je zato preduvjet modeliranju prozodije hrvatskoga jezika postojanje rječnika koji obuhvaća naglaske kako osnovnih tako i izvedenih oblika riječi. U okviru ove disertacije se stoga izradio takav rječnik. Obzirom da rječnikom ne mogu biti obuhvaćene sve riječi koje se pojavljuju u tekstu, razvio se i sustav za automatsko dodjeljivanje naglasaka riječima koje se ne nalaze u rječniku. Sustav se zasniva na modelu koji se učio na podacima iz izrađenog naglasnog rječnika. U okviru doktorskog rada provedena je i analiza trajanja slogova hrvatskoga jezika te je izrađen model trajanja slogova. Tilt intonacijski model primijenjen je za modeliranje F0 konture, a u tu svrhu označen je korpus od 500 rečenica označen Tilt oznakama. Zbog brojnih uloga prozodije u ljudskoj komunikaciji, njezino predviđanje i modeliranje je važno i može se primijeniti u brojnim područjima obrade prirodnog jezika kao što su automatsko raspoznavanje govora, sinteza govora, automatska identifikacija govornika i jezika, određivanja granica pojedinih tema, određivanja emocionalnih stanja sudionika u komunikaciji, kod sustava za strojno potpomognuto prevođenje, sustava za računalno potpomognuto učenje jezika itd.Human speech conveys a wide range of information on the pitch accent, intonation, duration, rhythm, pauses, speech rate, and these characteristics are often collectively referred to as prosody. Because of the many roles of prosody in human communication, its predicting and modelling is important and can be applied in many areas of natural language processing such as automatic speech recognition, speech synthesis, automatic identification of speakers and languages, determining emotional states etc. Previous to this research no extensive research on the prediction of prosodic characteristics and their modelling had been conducted for the Croatian language. In this doctoral thesis the applicability of the methods for prosodic features predicting and their modelling was tested for Croatian. The possibility of improving their performance with the inclusion of linguistic features and linguistic specificities typical for the Croatian language (for example - lexical stress) was explored. The Croatian language is a pitch accent language in which the tone contour realized in the prominent words carries lexical information. Therefore a prerequisite for modelling the prosody of Croatian is the existence of the lexicon in which lexical stress of both basic and derived forms of words is marked. Such a lexicon was created by implementing the rules for constructing derived forms of words based on the addition of the appropriate extension and on the place of stress moving if necessary. The entries in the lexicon are comprised of all derived words written without and with its corresponding stress and morph syntactic description (MSD) or part-of-speech tag (POS). Croatian belongs to the group of under-resourced languages and it is therefore considered that the importance of the lexicon will be significant and that it will be greatly applicable in various fields of natural language processing. The lexicon is comprised of 72,366 words in their basic form and over 1.000,00 derived word forms. Besides the lexicon, the product of the implementation of the rules for constructing derived forms of words is a system for automatic stress assignment for Croatian. The accuracy of the system based on the rules is tested by comparing the results of its implementation to a text to the same text in which the stress to the words was assigned by an expert. The obtained results are very good with the accuracy of 78% if the MSD tags are assigned automatically to the words, and 87,7% if the MSD tags were corrected by hand. There are words in Croatian that are written independently, but when it comes to their stress, they do not have one, but are prosodically leaning to the next or previous word. Such words are called clitics (proclitics and enclitics). There are cases in Croatian when the stress from the word that usually bears stress moves to the proclitic. Those rules are also implemented in the system and their implementation increased the accuracy of the system to 92,8%. Sometimes words from the text cannot be found in the lexicon. For such cases, a system for automatic lexical stress assignment to the words was developed. The system consists of two models trained on the data from the above-described lexicon. One model was trained for the place of the stress prediction and the other for the category of the stress prediction (there are four possible stress categories in Croatian). The accuracy of the model for place of the stress prediction measured by tenfold cross-validation is 90,56%, and the accuracy of the model for category of the stress prediction is 86,02%. The accuracy of the models are also tested on the text which was used for the evaluation of the system based on the rules. The achieved accuracy for the place of the stress prediction is 97,4%, for the category of the stress 82,4%, and for both place and category of the stress the achieved accuracy is 80,1%. The system based on the rules achieved batter accuracy compared to the system for automatic stress assignment based on the models. However, because there were words that were not assigned the stress after the implementation of the system based on the rules, the system for automatic stress assignment based on the models was used as a supplement to the system based on the rules in such cases. Such a hybrid approach achieved the accuracy of 95,3%. In this doctoral thesis an analysis of syllable duration for Croatian was conducted and duration model developed. It was determined that the position of the syllable within word and sentence has impact to the duration of the syllable. In average, the duration of the syllable increased by 41,4% compared to the reference value if its position was at the beginning of the word and by 37,0% if its position was at the end of the word. If the position of the syllable was at the beginning of the sentence, its duration increased by 71,8% compared to the reference value, and by 104,75% if the syllable was in the end of the sentence. The analysis also showed that the contextual features have impact to the duration of the syllables. The duration of the syllable increased by different percentages according to the category of the consonants that followed after the observed syllable. There were three categories of features taken into consideration in the duration model that was developed for Croatian - positional, contextual and those related to the stress. First, the accuracy of the duration model was tested after taking into consideration all three categories of the features. Then the accuracy of the model was tested after leaving out one of the category in order to determine how each category of the features contributes to the accuracy of the duration model. It was determined that all three categories impact the accuracy of the model in certain percentage and the greatest impact have features that belong to the positional category. For intonation modelling of the Croatian language, Tilt intonation model was applied. For that purpose, a database of 500 sentences was labelled with corresponding tilt labels. The best RMSE value that was obtained by comparing the obtained F0 contour to the original is 22,2
    corecore