Search CORE

221 research outputs found

Evaluating speech synthesis intelligibility using Amazon Mechanical Turk

Author: Isaac Karl B.
Renals Steve
Wolters Maria K.
Publication venue
Publication date: 01/01/2010
Field of study

Microtask platforms such as Amazon Mechanical Turk (AMT) are increasingly used to create speech and language resources. AMT in particular allows researchers to quickly recruit a large number of fairly demographically diverse participants. In this study, we investigated whether AMT can be used for comparing the intelligibility of speech synthesis systems. We conducted two experiments in the lab and via AMT, one comparing US English diphone to US English speaker-adaptive HTS synthesis and one comparing UK English unit selection to UK English speaker-dependent HTS synthesis. While AMT word error rates were worse than lab error rates, AMT results were more sensitive to relative differences between systems. This is mainly due to the larger number of listeners. Boxplots and multilevel modelling allowed us to identify listeners who performed particularly badly, while thresholding was sufficient to eliminate rogue workers. We conclude that AMT is a viable platform for synthetic speech intelligibility comparisons

CiteSeerX

Edinburgh Research Archive

Impacts of machine translation and speech synthesis on speech-to-speech translation

Author: Byrne William
Hashimoto Kei
King Simon
Tokuda Keiichi
Yamagishi Junichi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Crossref

Edinburgh Research Explorer

Confidence Intervals for ASR-based TTS Evaluation

Author: Richmond Korin
Taylor Jason
Publication venue: 'International Speech Communication Association'
Publication date: 03/09/2021
Field of study

Edinburgh Research Explorer

A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

Author: Belpaeme Tony
Robinson Nicole
Wolfert Pieter
Publication venue
Publication date: 11/01/2021
Field of study

Embodied Conversational Agents (ECA) take on different forms, including virtual avatars or physical agents, such as a humanoid robot. ECAs are often designed to produce nonverbal behaviour to complement or enhance its verbal communication. One form of nonverbal behaviour is co-speech gesturing, which involves movements that the agent makes with its arms and hands that is paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, such as rule-based and data-driven processes. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we conducted a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic or beat gestures, including their evaluation methods. We reviewed 22 studies that had an ECA with a human-like upper body that used co-speech gesturing in a social human-agent interaction, including a user study to evaluate its performance. We found most studies used a within-subject design and relied on a form of subjective evaluation, but lacked a systematic approach. Overall, methodological quality was low-to-moderate and few systematic conclusions could be drawn. We argue that the field requires rigorous and uniform tools for the evaluation of co-speech gesture systems. We have proposed recommendations for future empirical evaluation, including standardised phrases and test scenarios to test generative models. We have proposed a research checklist that can be used to report relevant information for the evaluation of generative models as well as to evaluate co-speech gesture use.Comment: 9 page

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Intelligibility of synthetic speech in noise and reverberation

Author: Isaac Karl Bruce
Publication venue: The University of Edinburgh
Publication date: 29/06/2015
Field of study

Synthetic speech is a valuable means of output, in a range of application contexts, for people with visual, cognitive, or other impairments or for situations were other means are not practicable. Noise and reverberation occur in many of these application contexts and are known to have devastating effects on the intelligibility of natural speech, yet very little was known about the effects on synthetic speech based on unit selection or hidden Markov models. In this thesis, we put forward an approach for assessing the intelligibility of synthetic and natural speech in noise, reverberation, or a combination of the two. The approach uses an experimental methodology consisting of Amazon Mechanical Turk, Matrix sentences, and noises that approximate the real-world, evaluated with generalized linear mixed models. The experimental methodologies were assessed against their traditional counterparts and were found to provide a number of additional benefits, whilst maintaining equivalent measures of relative performance. Subsequent experiments were carried out to establish the efficacy of the approach in measuring intelligibility in noise and then reverberation. Finally, the approach was applied to natural speech and the two synthetic speech systems in combinations of noise and reverberation. We have examine and report on the intelligibility of current synthesis systems in real-life noises and reverberation using techniques that bridge the gap between the audiology and speech synthesis communities and using Amazon Mechanical Turk. In the process, we establish Amazon Mechanical Turk and Matrix sentences as valuable tools in the assessment of synthetic speech intelligibility

Edinburgh Research Archive

Recommended from our members

Text-to-Speech Synthesis Using Found Data for Low-Resource Languages

Author: Cooper Erica Lindsay
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices. In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices. In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches: 1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible. 2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model. 3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself. We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data

Columbia University Academic Commons

Speeching: Mobile Crowdsourced Speech Assessment to Support Self-Monitoring and Management for People with Parkinson's

Author: Audhkhasi Kartik
Bigham Jeffrey P.
Cavender Anna C.
Côté Nicolas
Evanini Keelan
Goto Masataka
McGraw Ian
Miller Nick
Parent Gabriel
Putnam R
Swan M
Wicks Paul
Wolters Maria K.
Ziegler Wolfram
Publication venue
Publication date: 01/01/2016
Field of study

We present Speeching, a mobile application that uses crowdsourcing to support the self-monitoring and management of speech and voice issues for people with Parkinson's (PwP). The application allows participants to audio record short voice tasks, which are then rated and assessed by crowd workers. Speeching then feeds these results back to provide users with examples of how they were perceived by listeners unconnected to them (thus not used to their speech patterns). We conducted our study in two phases. First we assessed the feasibility of utilising the crowd to provide ratings of speech and voice that are comparable to those of experts. We then conducted a trial to evaluate how the provision of feedback, using Speeching, was valued by PwP. Our study highlights how applications like Speeching open up new opportunities for self-monitoring in digital health and wellbeing, and provide a means for those without regular access to clinical assessment services to practice-and get meaningful feedback on-their speech

Northumbria Research Link

Crossref

Edinburgh Research Explorer

Lancaster E-Prints

Explore Bristol Research

Lip2AudSpec: Speech reconstruction from silent lip movements video

Author: Akbari Hassan
Arora Himani
Cao Liangliang
Mesgarani Nima
Publication venue
Publication date: 26/10/2017
Field of study

In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy

arXiv.org e-Print Archive

Crossref