7 research outputs found
Location, location:Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion
Score (MOS) tests, where listeners score samples of synthetic speech on a
Likert scale. A major drawback of MOS tests is that they only offer a general
measure of overall quality-i.e., the naturalness of an utterance-and so cannot
tell us where exactly synthesis errors occur. This can make evaluation of the
appropriateness of prosodic variation within utterances inconclusive. To
address this, we propose a novel evaluation method based on the Rapid Prosody
Transcription paradigm. This allows listeners to mark the locations of errors
in an utterance in real-time, providing a probabilistic representation of the
perceptual errors that occur in the synthetic signal. We conduct experiments
that confirm that the fine-grained evaluation can be mapped to system rankings
of standard MOS tests, but the error marking gives a much more comprehensive
assessment of synthesized prosody. In particular, for standard audiobook test
set samples, we see that error marks consistently cluster around words at major
prosodic boundaries indicated by punctuation. However, for question-answer
based stimuli, where we control information structure, we see differences
emerge in the ability of neural TTS systems to generate context-appropriate
prosodic prominence.Comment: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en
Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech
Text alone does not contain sufficient information to predict the spoken form. Using additional information, such as the linguistic context, should improve Text-to-Speech naturalness in general, and prosody in particular. Most recent research on using context is limited to using textual features of adjacent utterances, extracted with large pre-trained language models such as BERT. In this paper, we compare multiple representations of linguistic context by conditioning a Text-to-Speech model on features of the preceding utterance. We experiment with three design choices: (1) acoustic vs. textual representations; (2) features extracted with large pre-trained models vs. features learnt jointly during training; and (3) representing context at the utterance level vs. word level. Our results show that appropriate representations of either text or acoustic context alone yield significantly better naturalness than a baseline that does not use context. Combining an utterance-level acoustic representation with a word-level textual representation gave the best results overall
La representación semántica de los eventos y las entidades en FunGramKB
El presente artículo presenta dos metodologías plausibles respecto a dos áreas
específicas del componente ontológico de la suite de FunGramKB: eventos y
entidades. En el caso de los eventos, se presenta brevemente las propiedades que los
caracterizan como unidades conceptuales dentro de FunGramKB. Posteriormente, se
describe la metodología aplicada, a partir de ejemplos que involucran unidades
conceptuales derivadas del dominio cognitivo de los eventos, #COMMUNICATION. El
objetivo es indicar, de manera práctica, cuáles son las decisiones y consideraciones
con las que el ingeniero del conocimiento puede encontrarse al trabajar con las
unidades conceptuales de los eventos en la suite de FunGramKB. En el caso de las
entidades, se pretende establecer criterios de análisis para resolver los problemas
derivados de su representación conceptual. Para lograr este objetivo, primero se
sistematiza la definición de entidad según el modelo, luego se establecen los criterios
para el trabajo con diccionario y, finalmente, el modo de proceder para la formalización
en COREL
Using local linguistic context for text-to-speech
Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter.
The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems.
In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context.
In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results.
We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone
An unsupervised method to select a speaker subset from large multi-speaker speech synthesis datasets
Large multi-speaker datasets for TTS typically contain diverse speakers, recording conditions, styles and quality of data. Although one might generally presume that more data is better, in this paper we show that a model trained on a carefully-chosen subset of speakers from LibriTTS provides significantly better quality synthetic speech than a model trained on a larger set. We propose an unsupervised methodology to find this subset by clustering per-speaker acoustic representations