221 research outputs found
Evaluating speech synthesis intelligibility using Amazon Mechanical Turk
Microtask platforms such as Amazon Mechanical Turk (AMT) are increasingly used to create speech and language resources. AMT in particular allows researchers to quickly recruit a large number of fairly demographically diverse participants. In this study, we investigated whether AMT can be used for comparing the intelligibility of speech synthesis systems. We conducted two experiments in the lab and via AMT, one comparing US English diphone to US English speaker-adaptive HTS synthesis and one comparing UK English unit selection to UK English speaker-dependent HTS synthesis. While AMT word error rates were worse than lab error rates, AMT results were more sensitive to relative differences between systems. This is mainly due to the larger number of listeners. Boxplots and multilevel modelling allowed us to identify listeners who performed particularly badly, while thresholding was sufficient to eliminate rogue workers. We conclude that AMT is a viable platform for synthetic speech intelligibility comparisons
A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents
Embodied Conversational Agents (ECA) take on different forms, including
virtual avatars or physical agents, such as a humanoid robot. ECAs are often
designed to produce nonverbal behaviour to complement or enhance its verbal
communication. One form of nonverbal behaviour is co-speech gesturing, which
involves movements that the agent makes with its arms and hands that is paired
with verbal communication. Co-speech gestures for ECAs can be created using
different generation methods, such as rule-based and data-driven processes.
However, reports on gesture generation methods use a variety of evaluation
measures, which hinders comparison. To address this, we conducted a systematic
review on co-speech gesture generation methods for iconic, metaphoric, deictic
or beat gestures, including their evaluation methods. We reviewed 22 studies
that had an ECA with a human-like upper body that used co-speech gesturing in a
social human-agent interaction, including a user study to evaluate its
performance. We found most studies used a within-subject design and relied on a
form of subjective evaluation, but lacked a systematic approach. Overall,
methodological quality was low-to-moderate and few systematic conclusions could
be drawn. We argue that the field requires rigorous and uniform tools for the
evaluation of co-speech gesture systems. We have proposed recommendations for
future empirical evaluation, including standardised phrases and test scenarios
to test generative models. We have proposed a research checklist that can be
used to report relevant information for the evaluation of generative models as
well as to evaluate co-speech gesture use.Comment: 9 page
Intelligibility of synthetic speech in noise and reverberation
Synthetic speech is a valuable means of output, in a range of application contexts,
for people with visual, cognitive, or other impairments or for situations were other
means are not practicable. Noise and reverberation occur in many of these application
contexts and are known to have devastating effects on the intelligibility of natural
speech, yet very little was known about the effects on synthetic speech based on unit
selection or hidden Markov models.
In this thesis, we put forward an approach for assessing the intelligibility of
synthetic and natural speech in noise, reverberation, or a combination of the two.
The approach uses an experimental methodology consisting of Amazon Mechanical
Turk, Matrix sentences, and noises that approximate the real-world, evaluated with
generalized linear mixed models.
The experimental methodologies were assessed against their traditional counterparts
and were found to provide a number of additional benefits, whilst maintaining
equivalent measures of relative performance. Subsequent experiments were carried
out to establish the efficacy of the approach in measuring intelligibility in noise and
then reverberation. Finally, the approach was applied to natural speech and the two
synthetic speech systems in combinations of noise and reverberation.
We have examine and report on the intelligibility of current synthesis systems in
real-life noises and reverberation using techniques that bridge the gap between the
audiology and speech synthesis communities and using Amazon Mechanical Turk. In
the process, we establish Amazon Mechanical Turk and Matrix sentences as valuable
tools in the assessment of synthetic speech intelligibility
Recommended from our members
Text-to-Speech Synthesis Using Found Data for Low-Resource Languages
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices.
In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices.
In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches:
1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible.
2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model.
3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself.
We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data
Speeching: Mobile Crowdsourced Speech Assessment to Support Self-Monitoring and Management for People with Parkinson's
We present Speeching, a mobile application that uses crowdsourcing to support the self-monitoring and management of speech and voice issues for people with Parkinson's (PwP). The application allows participants to audio record short voice tasks, which are then rated and assessed by crowd workers. Speeching then feeds these results back to provide users with examples of how they were perceived by listeners unconnected to them (thus not used to their speech patterns). We conducted our study in two phases. First we assessed the feasibility of utilising the crowd to provide ratings of speech and voice that are comparable to those of experts. We then conducted a trial to evaluate how the provision of feedback, using Speeching, was valued by PwP. Our study highlights how applications like Speeching open up new opportunities for self-monitoring in digital health and wellbeing, and provide a means for those without regular access to clinical assessment services to practice-and get meaningful feedback on-their speech
Lip2AudSpec: Speech reconstruction from silent lip movements video
In this study, we propose a deep neural network for reconstructing
intelligible speech from silent lip movement videos. We use auditory
spectrogram as spectral representation of speech and its corresponding sound
generation method resulting in a more natural sounding reconstructed speech.
Our proposed network consists of an autoencoder to extract bottleneck features
from the auditory spectrogram which is then used as target to our main lip
reading network comprising of CNN, LSTM and fully connected layers. Our
experiments show that the autoencoder is able to reconstruct the original
auditory spectrogram with a 98% correlation and also improves the quality of
reconstructed speech from the main lip reading network. Our model, trained
jointly on different speakers is able to extract individual speaker
characteristics and gives promising results of reconstructing intelligible
speech with superior word recognition accuracy
- …