3 research outputs found

    Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks

    Get PDF
    In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe

    The limits of the Mean Opinion Score for speech synthesis evaluation

    Get PDF
    The release of WaveNet and Tacotron has forever transformed the speech synthesis landscape. Thanks to these game-changing innovations, the quality of synthetic speech has reached unprecedented levels. However, to measure this leap in quality, an overwhelming majority of studies still rely on the Absolute Category Rating (ACR) protocol and compare systems using its output; the Mean Opinion Score (MOS). This protocol is not without controversy, and as the current state-of-the-art synthesis systems now produce outputs remarkably close to human speech, it is now vital to determine how reliable this score is.To do so, we conducted a series of four experiments replicating and following the 2013 edition of the Blizzard Challenge. With these experiments, we asked four questions about the MOS: How stable is the MOS of a system across time? How do the scores of lower quality systems influence the MOS of higher quality systems? How does the introduction of modern technologies influence the scores of past systems? How does the MOS of modern technologies evolve in isolation?The results of our experiments are manyfold. Firstly, we verify the superiority of modern technologies in comparison to historical synthesis. Then, we show that despite its origin as an absolute category rating, MOS is a relative score. While minimal variations are observed during the replication of the 2013-EH2 task, these variations can still lead to different conclusions for the intermediate systems. Our experiments also illustrate the sensitivity of MOS to the presence/absence of lower and higher anchors. Overall, our experiments suggest that we may have reached the end of a cul-de-sac by only evaluating the overall quality with MOS. We must embark on a new road and develop different evaluation protocols better suited to the analysis of modern speech synthesis technologies

    Segmental foreign accent

    Get PDF
    200 p.Tradicionalmente, el acento extranjero se ha estudiado desde una perspectiva holística, es decir, tratándolo como un todo en lugar de como una serie de rasgos individuales que suceden simultáneamente. Los estudios previos que se han centrado en alguno de estos rasgos individuales lo han hecho generalmente en el plano suprasegmental (Tajima et al., 1997, Munro & Derwing, 2001, Hahn, 2004, etc.). En esta tesis se lleva a cabo un análisis del acento extranjero desde un punto de vista segmental. Considerando que no existe mucha investigación en este campo, nuestro principal objetivo es averiguar si los resultados de estudios holísticos previos pueden ser extrapolados al nivel segmental. Con el objetivo de analizar el nivel segmental en detalle, en esta tesis se presentan técnicas que hacen uso de nuevas tecnologías. Para recabar la mayor información posible, los experimentos perceptivos son llevados a cabo con oyentes con muy distintos perfiles lingüísticos en términos de primera lengua o conocimiento de la segunda lengua y comparados con la literatura existente. Nuestros resultados muestran que algunos efectos importantes relativos a la producción y percepción de segmentos acentuados pueden pasar inadvertidos en un análisis holístico y acreditan la necesidad de continuar realizando estudios de unidades mínimas para comprender en profundidad los efectos del acento extranjero en la comunicación
    corecore