11 research outputs found
The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts?
In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc R○gave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data. Results suggest that mild audio quality artifacts introduced by lossy compression have almost as much impact on MOS perceived quality as concatenation errors introduced by sparse data in the smaller systems with bulked diphones. Index Terms: speech synthesis, unit selection. 1
The CereProc Blizzard Entry 2009: Some dumb algorithms that don't work
Within unit selection systems there is a constant tension between data sparsity and quality. This limits the control possible in a unit selection system. The RP data used in Blizzard this year and last year is expressive and spoken in a spirited manner. Last years entry focused on maintaining expressiveness, this year we focused on two simple algorithms to restrain and control this prosodic variation. 1) Variable width valley floor pruning on duration and pitch (Applied to the full database entry EH1), 2) Bulking of data with average HTS data (Applied to small database entry EH2). Results for both techniques were disappointing. The full database system achieved an MOS of around 2 (compared to 4 for a similar system attempting to emphasise variation in 2008), while the small database entry achieved an MOS of also 2 (compared to 3 for a similar system, but with a difference voice, entered in 2007). Index Terms: speech synthesis, unit selection. 1
Expressive speech synthesis: synthesising ambiguity
Previous work in HCI has shown that ambiguity, normally avoided in interaction design, can contribute to a user’s engagement by increasing interest and uncertainty. In this work, we create and evaluate synthetic utterances where there is a conflict between text content, and the emotion in the voice. We show that: 1) text content measurably alters the negative/positive perception of a spoken utterance, 2) changes in voice quality also produce this effect, 3) when the voice quality and text content are conflicting the result is a synthesised ambiguous utterance. Results were analysed using an evaluation/activation space. Whereas the effect of text content was restricted to the negative/positive dimension (valence), voice quality also had a significant effect on how active or passive the utterance was perceived (activation). Index Terms: speech synthesis, unit selection, expressive speech synthesis, emotion, prosody
Proper Name Splicing in Computer Games with TTS
Building high quality synthesis systems with open domain vocabulary and a small audio database is a challenging problem, even when the targeted application is well constrained. Monophone unit concatenation (as opposed to diphone) is an approach that can compensate for the poor unit coverage that a small database implies. However, joining at phone boundaries is a delicate task that requires accurate targeting. In this paper, we present an automatically trained targeting system based on the parametric synthesiser HTS, and compare it to a concatenative monophone system and a baseline concatenative diphone system. We apply a novel evaluation methodology which includes a qualitative component, and allows for fast incremental development of synthesis systems. Preliminary results show that although the hybrid system performed significantly more poorly on out of database items, it is less affected by segmentation errors than the monophone system. Index Terms: hybrid speech synthesis, unit selection, evaluation of TTS system
Voice puppetry: exploring dramatic performance to develop speech synthesis
Technology and innovation is often inspired by nature. However, when technology enters the social domain, such as creating human-like voices or having human-like conversations, mimicry can become an objective rather than an inspiration. In this paper we argue that performance and acting can offer a radically different design agenda to the mimicry objective. We compare a human mimic’s vocal performance (Alec Baldwin) of a target voice (Donald Trump) with the synthesis and copy resynthesis of a cloned synthetic voice. We show the conversational speaking style of natural performance is still a challenge to recreate with modern synthesis methods, and that resynthesis is hampered by current limitations in speech alignment approaches. We conclude by discussing how voice puppetry where a human voice is used to drive a synthesis engine - could be used to advance the state-of-the-art and the challenges involved in developing a voice puppetry system