115 research outputs found
Stochastic suprasegmentals: relationships between redundancy, prosodic structure and care of articulation in spontaneous speech
Within spontaneous speech there are wide variations in the articulation of the same word by the same speaker. This paper explores two related factors which influence variation in articulation, prosodic structure and redundancy. We argue that the constraint of producing robust communication while efficiently expending articulatory effort leads to an inverse relationship between language redundancy and care of articulation. The inverse relationship improves robustness by spreading the information more evenly across the speech signal leading to a smoother signal redundancy profile. We argue that prosodic prominence is a linguistic means of achieving smooth signal redundancy. Prosodic prominence increases care of articulation and coincides with unpredictable sections of speech. By doing so, prosodic prominence leads to a smoother signal redundancy. Results confirm the strong relationship between prosodic prominence and care of articulation as well as an inverse relationship between langu..
Single Speaker Segmentation and Inventory Selection Using Dynamic Time Warping Self Organization and Joint Multigram Mapping
In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system
Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning
The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have shown it is possible to produce synthetic voices from quite small amounts of speech data. However, mimicking the depth and variation of a speaker's prosody as well as synthesising natural voice quality is still a challenging research problem. In contrast, unit-selection systems have shown it is possible to strongly retain the character of the voice but only with sufficient original source material. Often this runs into hours and may require significant manual checking and labelling. In this paper we will present two state of the art systems, an HMM based system HTS-2007, developed by CSTR and Nagoya Institute Technology, and a commercial unit-selection system CereVoice, developed by Cereproc. Both systems have been used to mimic the voice of George W. Bush (43rd president of the United States) using freely available audio from the web. In addition we will present a hybrid system which combines both technologies. We demonstrate examples of synthetic voices created from 10, 40 and 210 minutes of randomly selected speech. We will then discuss the underlying problems associated with voice cloning using found audio, and the scalability of our solution
The Smartphone: A Lacanian Stain, A Tech Killer, and an Embodiment of Radical Individualism
YAFR (Yet another futile rant) presents the smartphone: an unstoppable piece of technology generated from a perfect storm of commercial, technological, social and psychological factors. We begin by misquoting Steve Jobs and by being unfairly rude about the HCI community. We then consider the smartphone's ability to kill off competing technology and to undermine collectivism. We argue that its role as a Lacanian stain, an exploitative tool, and as a means of concentrating power into the hands of the few, make it a technology that will rival the personal automobile in its effect on modern society
The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts?
In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc R○gave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data. Results suggest that mild audio quality artifacts introduced by lossy compression have almost as much impact on MOS perceived quality as concatenation errors introduced by sparse data in the smaller systems with bulked diphones. Index Terms: speech synthesis, unit selection. 1
Speech Synthesis Without a Phone Inventory
In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch
Generating Narratives from Personal Digital Data: Using Sentiment, Themes, and Named Entities to Construct Stories
As the quantity and variety of personal digital data shared on social media continues to grow, how can users make sense of it? There is growing interest among HCI researchers in using narrative techniques to support interpretation and understanding. This work describes our prototype application, ReelOut, which uses narrative techniques to allow users to understand their data as more than just a database. The online service extracts data from multiple social media sources and augments it with semantic information such as sentiment, themes, and named entities. The interactive editor automatically constructs a story by using unit selection to fit data units to a simple narrative structure. It allows the user to change the story interactively by rejecting certain units or selecting a new narrative target. Finally, images from the story can be exported as a video clip or a collage
Stochastic Suprasegmentals: Relationships between Redundancy, Prosodic Structure and Syllabic Duration
Within spontaneous speech there are wide variations in the articulation of the same word by the same speaker. This paper explores two related factors which influence variation in articulation, prosodic structure and redundancy. We argue that the constraint of producing robust communication while efficiently expending articulatory effort leads to an inverse relationship between language redundancy and care of articulation. The inverse relationship improves robustness by spreading the information more evenly across the speech signal leading to a smoother signal redundancy profile. We argue that prosodic prominence is a linguistic means of achieving smooth signal redundancy. Prosodic prominence increases care of articulation and coincides with unpredictable sections of speech. By doing so, prosodic prominence leads to a smoother signal redundancy. Results confirm the strong relationship between prosodic prominence and care of articulation as well as an inverse relationship between language redundancy and care of articulation. In addition, when variation in prosodic boundaries is controlled for, language redundancy can predict up to 65 % of the variance in raw syllabic duration. This is comparable with 64 % predicted by prosodic prominence (accent, lexical stress and vowel type). Moreover most (62%) of this predictive power is shared. This suggests that, in English, prosodic structure is the means with which constraints caused by a robust signal requirement are expressed in spontaneous speech. 1
The CereProc Blizzard Entry 2009: Some dumb algorithms that don't work
Within unit selection systems there is a constant tension between data sparsity and quality. This limits the control possible in a unit selection system. The RP data used in Blizzard this year and last year is expressive and spoken in a spirited manner. Last years entry focused on maintaining expressiveness, this year we focused on two simple algorithms to restrain and control this prosodic variation. 1) Variable width valley floor pruning on duration and pitch (Applied to the full database entry EH1), 2) Bulking of data with average HTS data (Applied to small database entry EH2). Results for both techniques were disappointing. The full database system achieved an MOS of around 2 (compared to 4 for a similar system attempting to emphasise variation in 2008), while the small database entry achieved an MOS of also 2 (compared to 3 for a similar system, but with a difference voice, entered in 2007). Index Terms: speech synthesis, unit selection. 1
- …