Search CORE

60 research outputs found

Evaluating speech synthesis intelligibility using Amazon Mechanical Turk

Author: Isaac Karl B.
Renals Steve
Wolters Maria K.
Publication venue
Publication date: 01/01/2010
Field of study

Microtask platforms such as Amazon Mechanical Turk (AMT) are increasingly used to create speech and language resources. AMT in particular allows researchers to quickly recruit a large number of fairly demographically diverse participants. In this study, we investigated whether AMT can be used for comparing the intelligibility of speech synthesis systems. We conducted two experiments in the lab and via AMT, one comparing US English diphone to US English speaker-adaptive HTS synthesis and one comparing UK English unit selection to UK English speaker-dependent HTS synthesis. While AMT word error rates were worse than lab error rates, AMT results were more sensitive to relative differences between systems. This is mainly due to the larger number of listeners. Boxplots and multilevel modelling allowed us to identify listeners who performed particularly badly, while thresholding was sufficient to eliminate rogue workers. We conclude that AMT is a viable platform for synthetic speech intelligibility comparisons

CiteSeerX

Edinburgh Research Archive

Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise

Author: King Simon
Valentini-Botinhao Cassia
Yamagishi Junichi
Publication venue
Publication date: 01/01/2012
Field of study

It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speechshaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice. Index Terms: HMM-based speech synthesis, intelligibility of speech in noise, Lombard speec

CiteSeerX

Edinburgh Research Explorer

Modulation of speech-in-noise comprehension through transcranial current stimulation with the phase-shifted speech envelope

Author: Kadir Shabnam
Kaza Chrysoula
Reichenbach Tobias
Weissbart Hugo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/09/2019
Field of study

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/Neural activity tracks the envelope of a speech signal at latencies from 50 ms to 300 ms. Modulating this neural tracking through transcranial alternating current stimulation influences speech comprehension. Two important variables that can affect this modulation are the latency and the phase of the stimulation with respect to the sound. While previous studies have found an influence of both variables on speech comprehension, the interaction between both has not yet been measured. We presented 17 subjects with speech in noise coupled with simultaneous transcranial alternating current stimulation. The currents were based on the envelope of the target speech but shifted by different phases, as well as by two temporal delays of 100 ms and 250 ms. We also employed various control stimulations, and assessed the signal-to-noise ratio at which the subject understood half of the speech. We found that, at both latencies, speech comprehension is modulated by the phase of the current stimulation. However, the form of the modulation differed between the two latencies. Phase and latency of neurostimulation have accordingly distinct influences on speech comprehension. The different effects at the latencies of 100 ms and 250 ms hint at distinct neural processes for speech processing.Peer reviewe

Spiral - Imperial College Digital Repository

Radboud Repository

University of Hertfordshire Research Archive

Evaluating the speech quality of the Norwegian synthetic voice Brage

Author: Olaussen Marius
Publication venue
Publication date: 10/05/2011
Field of study

Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 336-339. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1695

DSpace at Tartu University Library

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

Crossref

University of East Anglia digital repository

Letter-based speech synthesis

Author: King Simon
Watts Oliver
Yamagishi Junichi
Publication venue
Publication date: 01/09/2010
Field of study

Initial attempts at performing text-to-speech conversion based on standard orthographic units are presented, forming part of a larger scheme of training TTS systems on features that can be trivially extracted from text. We evaluate the possibility of using the technique of decision-tree-based context clustering conventionally used in HMM-based systems for parametertying to handle letter-to-sound conversion. We present the application of a method of compound-feature discovery to corpusbased speech synthesis. Finally, an evaluation of intelligibility of letter-based systems and more conventional phoneme-based systems is presented

CiteSeerX

Edinburgh Research Archive

Edinburgh Research Explorer

Statistical parametric speech synthesis for Ibibio

Author: Benoit
Chomphan
Chunwijitra
Ekpenyong
Ekpenyong
Eno-Abasi Urua
Essien
Junichi Yamagishi
Moses Ekpenyong
Oliver Watts
Shinoda
Simon King
Taylor
Yamagishi
Zen
Zen
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to oﬀer good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody. We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes signiﬁcantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words. Key words: speech synthesis, Ibibio, low-resource languages, HT

Crossref

Edinburgh Research Explorer

University of Uyo Institutional Repository

Development of the Slovak HMM-Based TTS System and Evaluation of Voices in Respect to the Used Vocoding Techniques

Author: Juhár Jozef
Rusko Milan
Sulír Martin
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 10/02/2017
Field of study

This paper describes the development of a Slovak text-to-speech system which applies a technique wherein speech is directly synthesized from hidden Markov models. Statistical models for Slovak speech units are trained by using the newly created female and male phonetically balanced speech corpora. In addition, contextual informations about phonemes, syllables, words, phrases, and utterances were determined, as well as questions for decision tree-based context clustering algorithms. In this paper, recent statistical parametric speech synthesis methods including the conventional, STRAIGHT and AHOcoder speech synthesis systems are implemented and evaluated. Objective evaluation methods (mel-cepstral distortion and fundamental frequency comparison) and subjective ones (mean opinion score and semantically unpredictable sentences test) are carried out to compare these systems with each other and evaluation of their overall quality. The result of this work is a set of text to speech systems for Slovak language which are characterized by very good intelligibility and quite good naturalness of utterances at the output of these systems. In the subjective tests of intelligibility the STRAIGHT based female voice and AHOcoder based male voice reached the highest scores

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

How reliable are online speech intelligibility studies with known listener cohorts?

Author: Cooke Martin
García Lecumberri María Luisa
Publication venue: AIP
Publication date: 25/08/2021
Field of study

Although the use of nontraditional settings for speech perception experiments is growing, there have been few controlled comparisons of online and laboratory modalities in the context of speech intelligibility. The current study compares outcomes from three web-based replications of recent laboratory studies involving distorted, masked, fil- tered, and enhanced speech, amounting to 40 separate conditions. Rather than relying on unrestricted crowdsourcing, this study made use of participants from the population that would normally volunteer to take part physically in labo- ratory experiments. In sentence transcription tasks, the web cohort produced intelligibility scores 3–6 percentage points lower than their laboratory counterparts, and test modality interacted with experimental condition. These disparities and interactions largely disappeared after the exclusion of those web listeners who self-reported the use of low quality headphones, and the remaining listener cohort was also able to replicate key outcomes of each of the three laboratory studies. The laboratory and web modalities produced similar measures of experimental efficiency based on listener variability, response errors, and outlier counts. These findings suggest that the combination of known listener cohorts and moderate headphone quality provides a feasible alternative to traditional laboratory intel- ligibility studies.Basque Government Consolidados programme under Grant No. IT311-1

Archivo Digital para la Docencia y la Investigación