266 research outputs found

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Recent development of the HMM-based speech synthesis system (HTS)

    Get PDF
    A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

    Speech vocoding for laboratory phonology

    Get PDF
    Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

    Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis Trained with Noise-Degraded Data Supported by the University of Costa Rica

    Get PDF
    After the successful implementation of speech synthesis in several languages, the study of robustness became an important topic so as to increase the possibility of building voices from non-standard sources, e.g. historical recordings, children's speech, and data freely available on the Internet. In this work, a measure of the influence of noise in the source speech of the statistical parametric speech synthesis system based on HMM is performed, for a case of a low-resourced database. For this purpose, three types of additive noise were considered at five signal-to-noise ratio levels to affect the source speech data. Using objective measures to assess the perceptual quality of the results and the propagation of the noise through all the processes of building speech synthesis, the results show a severe drop in the quality of artificial speech, even for the cases of lower levels of noise. Such degradation seems to be independent of the noise type, and is at lower proportion to the noise level. This results are of importance for any practical implementation of speech synthesis from degraded data in similar conditions, and shows that applying denoising processes became mandatory in order to keep the possibility of building intelligible voices.UCR::Vicerrectoría de Docencia::Ingeniería::Facultad de Ingeniería::Escuela de Ingeniería Eléctric

    The Blizzard Challenge 2009

    Get PDF
    The Blizzard Challenge 2009 was the fifth annual Blizzard Challenge. As in 2008, UK English and Mandarin Chinese were the chosen languages for the 2009 Challenge. The English corpus was the same one used in 2008. The Mandarin corpus was provided by iFLYTEK. As usual, participants with limited resources or limited experience in these languages had the option of using unaligned labels that were provided for both corpora and for the test sentences. An accent-specific pronunciation dictionary was also available for the English speaker. This year, the tasks were organised in the form of ‘hubs ’ and ‘spokes ’ where each hub task involved building a general-purpose voice and each spoke task involved building a voice for a specific application. A set of test sentences was released to participants, who were given a limited time in which to synthesise them and submit the synthetic speech. An online listening test was conducted to evaluate naturalness, intelligibility, degree of similarity to the original speaker and, for one of the spoke tasks, “appropriateness.

    Discriminative multi-stream postfilters based on deep learning for enhancing statistical parametric speech synthesis

    Get PDF
    Statistical parametric speech synthesis based on Hidden Markov Models has been an important technique for the production of artificial voices, due to its ability to produce results with high intelligibility and sophisticated features such as voice conversion and accent modification with a small footprint, particularly for low-resource languages where deep learning-based techniques remain unexplored. Despite the progress, the quality of the results, mainly based on Hidden Markov Models (HMM) does not reach those of the predominant approaches, based on unit selection of speech segments of deep learning. One of the proposals to improve the quality of HMM-based speech has been incorporating postfiltering stages, which pretend to increase the quality while preserving the advantages of the process. In this paper, we present a new approach to postfiltering synthesized voices with the application of discriminative postfilters, with several long short-term memory (LSTM) deep neural networks. Our motivation stems from modeling specific mapping from synthesized to natural speech on those segments corresponding to voiced or unvoiced sounds, due to the different qualities of those sounds and how HMM-based voices can present distinct degradation on each one. The paper analyses the discriminative postfilters obtained using five voices, evaluated using three objective measures, Mel cepstral distance and subjective tests. The results indicate the advantages of the discriminative postilters in comparison with the HTS voice and the non-discriminative postfilters.Universidad de Costa Rica/[322-B9-105]/UCR/Costa RicaUCR::Vicerrectoría de Docencia::Ingeniería::Facultad de Ingeniería::Escuela de Ingeniería Eléctric

    Expressive speech synthesis from Broadcast News

    Get PDF
    Speech Synthesis is the computer process of converting text to voice. This project consists in the synthesis of voices that can tell news with an appropriate expression, since it is important to achieve expressiveness on the generated speech in order to obtain natural sounding voices. Conventional Speech Synthesis systems use as training data audios signals, specifically recorded for voice models training. Nevertheless, in this project the data was obtained from a news TV station, in order to test a different database in the speech synthesis. An important part of the work done in this TFG has been preparing data later used in synthesis. The audio and its transcriptions were labeled so as to differentiate the expressions recorded: explaining good or bad news, or talking about relevant or trivial topics. A phonetic segmentation of the database was obtained in order to create the models used in the speech synthesis. After preparing all the audio and transcriptions data, statistical-parametric models were estimated and used to synthesize test voices, in order to evaluate the previous setup work. All the project has been developed in a Linux environment, using Ogmios, AHOCoder and HTS-toolkit as main software. The results obtained after synthesizing the voices shows that the data preparation process is correct, but the voices synthesized had not the enough quality. This is due to the adaptation of the voices towards heterogeneous samples, originated by the amount of different speakers used to train the models.La síntesis de voz es el proceso informático mediante el cual se transforma texto a voz. Este proyecto consiste en la síntesis de voces que puedan explicar notícias con una expresión adecuada, ya que es importante obtener expresividad en el habla generada para poder generar voces con naturalidad expresiva. Los sistemas de síntesis del habla convencionales utilizan como datos de entrenamiento voces grabadas expresamente para el entrenamiento de los modelos. No obstante, en este proyecto se ha creado una base de datos a partir de unas grabaciones de un canal de televisión especializado en noticias, ya que se queria probar la síntesis de voz con una base de datos diferente. Una parte importante del trabajo llevado a cabo en este TFG ha sido la preparación de los datos utilizados en la grabación. Las grabaciones y sus transcripciones se etiquetaron con la intención de diferenciar las expresiones grabadas: explicando buenas o malas noticias, o hablando de temas relevantes o triviales. Se ha obtenido una segmentación de la base de datos con tal de crear los modelos utilizados en la síntesis del habla. Una vez preparados los audios y sus respectivas transcripciones, se estimaron los modelos estadístico-paramétricos y se utilizaron para sintetizar las voces de prueba, con el objetivo de evaluar el trabajo de preparación anterior. Todo el proyecto se ha realizado en un entorno Linux, utilizando \emph{Ogmios}, \emph{AHOCoder} y HTS-toolkit como software principal. Los resultados obtenidos después de la síntesis muestran que la preparación de los datos es correcta, pero las voces sintetizadas no tenian la calidad suficiente. Esto se debe a la adaptación de las voces a partir de una base de datos muy heterogénea, debido a la cantidad de hablantes diferentes contemplados en el entrenamiento de los modelos.La síntesi de veu es el procés informàtic que transforma text a veu. Aquest projecte consisteix en la sínteis de veus que poden explicar notícies amb una expressió adient, ja que és important obtenir expressivitat en la parla generada per tal d'obtenir veus amb naturalitat expressiva. Els sistemes de síntesis de la parla convencionals utilitzen com a dades d'entrenament veus gravades expressament pel entrenament dels models. No obstant, en aquest projecte s'ha creat una base de dades a partir d'unes gravacions d'un canal de televisió especialitzat en notícies, ja que es volia provar a sintetizar veu amb una base de dades diferent. Una part important del treball dut a terme en aquest TFG ha sigut preparar les dades desp?es utilitzades en l'entrenament. Les gravacions i les seves transcripcions van ser etiquetades amb la intenció de diferenciar les epxressions gravades: explicant males o bones notícies, o parlant de temes rellevants o trivials. S'ha obtingut una segmentació de la base de dades per tal de crear els models utilitzats en la síntesi de la parla. Una vegada preparat els audios i les seves transcripcions, es van estimar models estadístic-paramètrics i es van utilitzar per sintetizar les veu de prova, amb l'objectiu de evaluar el treball de preparació anterior. Tot el projecte s'ha realitzat en un entorn Linux, fent servir \emph{Ogmios}, \emph{AHOCoder} i HTS-toolkit com a software principal. Els resultats obtinguts desprès de la síntesi mostren que la preparació de les dades es correcta, però les veus sintetitzades no teníen qualitat suficient. Això es deu a l'adaptacio de les veus a partir d'una base de dades molt heterogènia, degut a la quantitat de parlants diferents contemplats en l'entrenament dels models
    corecore