7 research outputs found

    Location, location:Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm

    Get PDF
    Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.Comment: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en

    Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

    Get PDF
    Text alone does not contain sufficient information to predict the spoken form. Using additional information, such as the linguistic context, should improve Text-to-Speech naturalness in general, and prosody in particular. Most recent research on using context is limited to using textual features of adjacent utterances, extracted with large pre-trained language models such as BERT. In this paper, we compare multiple representations of linguistic context by conditioning a Text-to-Speech model on features of the preceding utterance. We experiment with three design choices: (1) acoustic vs. textual representations; (2) features extracted with large pre-trained models vs. features learnt jointly during training; and (3) representing context at the utterance level vs. word level. Our results show that appropriate representations of either text or acoustic context alone yield significantly better naturalness than a baseline that does not use context. Combining an utterance-level acoustic representation with a word-level textual representation gave the best results overall

    La representación semántica de los eventos y las entidades en FunGramKB

    Get PDF
    El presente artículo presenta dos metodologías plausibles respecto a dos áreas específicas del componente ontológico de la suite de FunGramKB: eventos y entidades. En el caso de los eventos, se presenta brevemente las propiedades que los caracterizan como unidades conceptuales dentro de FunGramKB. Posteriormente, se describe la metodología aplicada, a partir de ejemplos que involucran unidades conceptuales derivadas del dominio cognitivo de los eventos, #COMMUNICATION. El objetivo es indicar, de manera práctica, cuáles son las decisiones y consideraciones con las que el ingeniero del conocimiento puede encontrarse al trabajar con las unidades conceptuales de los eventos en la suite de FunGramKB. En el caso de las entidades, se pretende establecer criterios de análisis para resolver los problemas derivados de su representación conceptual. Para lograr este objetivo, primero se sistematiza la definición de entidad según el modelo, luego se establecen los criterios para el trabajo con diccionario y, finalmente, el modo de proceder para la formalización en COREL

    Using local linguistic context for text-to-speech

    Get PDF
    Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter. The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems. In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context. In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results. We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone

    An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets

    Get PDF
    Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations
    corecore