1 research outputs found

    S铆ntesis de voz aplicada a la traducci贸n voz a voz

    Get PDF
    In the field of speech technologies, text-to-speech conversion is the automatic generation of artificial voices that sound identical to a human voice when reading a text in loud speech. Inside a text-to-speech system, the prosody module produces the prosodic information that is necessary to generate a natural voice: intonational phrases, intonation of the sentence, duration and energy of phonemes, etc. The correct generation of this information directly impacts in the naturalness and expressiveness of the system. The main goals of this thesis is the development of new algorithms to train models for prosody generation that may be used in a text-to-speech system, and their use in the framework of speech-to-speech translation. In this thesis several alternatives were studied for intonation modeling. They combine the parameterization and the intonation model generation as a integrated process. Such approach was successfully judged both with objective and subjective evaluations. The influence of segmental and suprasegmental factors in duration modeling was also studied. Several algorithms were proposed with the results of these studies that may combine segmental and suprasegmental information, likewise other publications of this field. Finally, an analysis of various phrase break models was also performed, both with words and accent groups: classification trees (CART), language modeling (LM) and finite state transducers (FST). The use of the same data set in the experiments was useful to obtain relevant conclusions about the differences between these models. One of the main goals of this thesis was the improvement of naturalness, expressiveness and consistency with the style of the source speaker in text-to-speech systems. This may be done by using the prosody of the source speaker in the framework of speech-to-speech translation as an additional information source. Several algorithms were developed for prosody generation that may integrate such additional information for the prediction of intonation, phoneme duration and phrase breaks. In that direction several approaches were studied to transfer the intonation from one language to the other. The chosen approach was an automatic clustering algorithm that finds a certain number of tonal movements that are related between languages, without any limitation about their number. In this way, it is possible to use this coding for intonation modeling of the target language. Experimental results show an improvement, that is more relevant in close languages, such as Spanish and Catalan. Although no segmental duration transfer was performed between languages, in this thesis is proposed the transfer of rhythm from one language to the other. For that purpose a method that combines the rhythm transfer and audio synchronization was proposed. The synchronizations is included because of its importance for the speech-to-speech translation technology when video is also used. Lastly, in this thesis was also proposed a pause transfer technique in the framework of speech-to-speech translation, by means of alignment information. Studies in training data have shown the advantage of tuples for this task. In order to predict any pause that can not be transferred using the before mentioned method, conventional pause prediction algorithms are used (CART, CART+LM, FST), taking into account the already transferred pauses.Dentro de las tecnolog铆as del habla, la conversi贸n texto a voz consiste en la generaci贸n, por medios autom谩ticos, de una voz artificial que genera id茅ntico sonido al producido por una persona al leer un texto en voz alta. En resumen, los conversores texto a voz son sistemas que permiten la conversi贸n de textos en voz sint茅tica. El proceso de conversi贸n texto a voz se divide en tres m贸dulos b谩sicos: procesamiento del texto, generaci贸n de la prosodia y generaci贸n de la voz sint茅tica. En el primero de los m贸dulos se realiza la normalizaci贸n del texto (para expandir abreviaciones, convertir n煤meros y fechas en texto, etc), y en ocasiones, luego tambi茅n se hace un etiquetado morfosint谩ctico. A continuaci贸n se procede a la conversi贸n de los grafemas en fonemas y a la silabificaci贸n para obtener la secuencia de fonemas necesaria para reproducir el texto. Posteriormente, el m贸dulo de prosodia genera la informaci贸n pros贸dica para poder producir la voz. Para ello se predicen las frases entonativas y la entonaci贸n de la oraci贸n, y tambi茅n la duraci贸n y la energ铆a de los fonemas, etc. La correcta generaci贸n de esta informaci贸n repercutir谩 directamente en la naturalidad y expresividad del sistema. En el 煤ltimo modulo de generaci贸n de la voz es donde se produce la voz considerando la informaci贸n provista por los m贸dulos de procesamiento del texto y prosodia. El objetivo de la presente tesis es el desarrollo de nuevos algoritmos para el entrenamiento de modelos de generaci贸n de prosodia para la conversi贸n texto a voz, y su aplicaci贸n en el marco de la traducci贸n voz a voz. En el caso de los algoritmos de modelado de entonaci贸n, en la literatura se proponen generalmente enfoques que incluyen una estilizaci贸n previa a la parametrizaci贸n. En esta tesis se estudiaron alternativas para evitar esa estilizaci贸n, combinando la parametrizaci贸n y la generaci贸n del modelo de entonaci贸n en un todo integrado. Dicho enfoque ha resultado exitoso tanto en la evaluaci贸n objetiva (usando medidas como el error cuadr谩tico medio o el coeficiente de correlaci贸n Pearson) como en la subjetiva. Los evaluadores han considerado que el enfoque propuesto tiene una calidad y una naturalidad superiores a otros algoritmos existentes en la literatura incluidos en las evaluaciones, alcanzando un MOS de naturalidad de 3,55 (4,63 para la voz original) y un MOS de calidad de 3,78 (4,78 para la voz original).Postprint (published version
    corecore