Search CORE

11 research outputs found

Location, location:Enhancing the evaluation of text-to-speech synthesis using the rapid prosody transcription paradigm

Author: Gutierrez Elijah
Lai Catherine
Oplustil-Gallegos Pilar
Publication venue
Publication date: 06/07/2021
Field of study

Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.Comment: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en

arXiv.org e-Print Archive

Edinburgh Research Explorer

Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

Author: King Simon
O'Mahony Johannah
Oplustil Gallegos Pilar
Publication venue
Publication date: 26/08/2021
Field of study

Text alone does not contain sufficient information to predict the spoken form. Using additional information, such as the linguistic context, should improve Text-to-Speech naturalness in general, and prosody in particular. Most recent research on using context is limited to using textual features of adjacent utterances, extracted with large pre-trained language models such as BERT. In this paper, we compare multiple representations of linguistic context by conditioning a Text-to-Speech model on features of the preceding utterance. We experiment with three design choices: (1) acoustic vs. textual representations; (2) features extracted with large pre-trained models vs. features learnt jointly during training; and (3) representing context at the utterance level vs. word level. Our results show that appropriate representations of either text or acoustic context alone yield significantly better naturalness than a baseline that does not use context. Combining an utterance-level acoustic representation with a word-level textual representation gave the best results overall

Edinburgh Research Explorer

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

Author: King Simon
Oplustil Pilar
Rownicka Joanna
Williams Jennifer
Publication venue
Publication date: 27/04/2020
Field of study

We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.Comment: accepted at Speaker Odyssey 202

arXiv.org e-Print Archive

Crossref

Southampton (e-Prints Soton)

Nativization of foreign names in TTS for automatic reading of world news in Swahili

Author: King Simon
Mendelson Joseph
Oplustil Pilar
Watts Oliver
Publication venue: 'International Speech Communication Association'
Publication date: 24/08/2017
Field of study

Crossref

Edinburgh Research Explorer

ANA MARÍA FERNÁNDEZ PLANAS (2011): Así se habla. Nociones fundamentales de fonética general y española. Apuntes de catalán, gallego y euskara, Barcelona, Horsori, 2ª edición revisada.

Author: Montes de Oca Domingo Román
Oplustil Gallegos Pilar
Publication venue: Universitat de Barcelona
Publication date
Field of study

Revistes Catalanes amb Accés Obert

La representación semántica de los eventos y las entidades en FunGramKB

Author: Barrera Marcia
Gallegos Camila
Ibarra Daniela
Infante Sanndy
Mora Valeria
Núñez Fredy
Oplustil Pilar
Pino Josué
Publication venue
Publication date: 01/01/2011
Field of study

El presente artículo presenta dos metodologías plausibles respecto a dos áreas específicas del componente ontológico de la suite de FunGramKB: eventos y entidades. En el caso de los eventos, se presenta brevemente las propiedades que los caracterizan como unidades conceptuales dentro de FunGramKB. Posteriormente, se describe la metodología aplicada, a partir de ejemplos que involucran unidades conceptuales derivadas del dominio cognitivo de los eventos, #COMMUNICATION. El objetivo es indicar, de manera práctica, cuáles son las decisiones y consideraciones con las que el ingeniero del conocimiento puede encontrarse al trabajar con las unidades conceptuales de los eventos en la suite de FunGramKB. En el caso de las entidades, se pretende establecer criterios de análisis para resolver los problemas derivados de su representación conceptual. Para lograr este objetivo, primero se sistematiza la definición de entidad según el modelo, luego se establecen los criterios para el trabajo con diccionario y, finalmente, el modo de proceder para la formalización en COREL

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Multi-style Text-to-Speech using Recurrent Neural Networks for Chilean Spanish

Author: Oplustil Pilar
Publication venue: The University of Edinburgh
Publication date: 01/01/2016
Field of study

Edinburgh Research Archive

Using local linguistic context for text-to-speech

Author: Oplustil-Gallegos Pilar
Publication venue: The University of Edinburgh
Publication date: 15/03/2023
Field of study

Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter. The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems. In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context. In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results. We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone

Edinburgh Research Archive