545,447 research outputs found
Including Pitch Accent Optionality in Unit Selection Text-to-Speech Synthesis
A significant variability in pitch accent placement is found when comparing the patterns of prosodic prominence realized by different English speakers reading the same sentences. In this paper we describe a simple approach to incorporate this variability to synthesize prosodic prominence in unit selection text-to-speech synthesis. The main motivation of our approach is that by taking into account the variability of accent placements we enlarge the set of prosodically acceptable speech units, thus increasing the chances of selecting a good quality sequence of units, both in prosodic and segmental terms. Results on a large scale perceptual test show the benefits of our approach and indicate directions for further improvements. Index Terms: speech synthesis, unit selection, prosodic prominence, pitch accent
Assessing the adequate treatment of fast speech in unit selection systems for the visually impaired
Moers D, Wagner P. Assessing the adequate treatment of fast speech in unit selection systems for the visually impaired. In: Proceedings of the 6th ISCA Tutorial and Research Workshop on Speech Synthesis (SSW-6). 2007: 282-287.This paper describes work in progress concerning the adequate modeling of fast speech in unit selection speech synthesis systems â mostly having in mind blind and visually impaired users. Initially, a survey of the main phonetic characteristics of fast speech will be given. From this, certain conclusions concerning an adequate modeling of fast speech in unit selection synthesis will be drawn. Subsequently, a questionnaire assessing synthetic speech related preferences of visually
impaired users will be presented. The last section deals with future experiments aiming at a definition of criteria for the development of synthesis corpora modeling fast speech within the unit selection paradigm
Fast Speech in Unit Selection Speech Synthesis
Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: Universität Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output
Using same-language machine translation to create alternative target sequences for text-to-speech synthesis
Modern speech synthesis systems attempt to produce
speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same
meaning. This method allows the synthesiser to overcome the
problem of insufficient units in runtime
The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts?
In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc Râgave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data. Results suggest that mild audio quality artifacts introduced by lossy compression have almost as much impact on MOS perceived quality as concatenation errors introduced by sparse data in the smaller systems with bulked diphones. Index Terms: speech synthesis, unit selection. 1
In-Network View Synthesis for Interactive Multiview Video Systems
To enable Interactive multiview video systems with a minimum view-switching
delay, multiple camera views are sent to the users, which are used as reference
images to synthesize additional virtual views via depth-image-based rendering.
In practice, bandwidth constraints may however restrict the number of reference
views sent to clients per time unit, which may in turn limit the quality of the
synthesized viewpoints. We argue that the reference view selection should
ideally be performed close to the users, and we study the problem of in-network
reference view synthesis such that the navigation quality is maximized at the
clients. We consider a distributed cloud network architecture where data stored
in a main cloud is delivered to end users with the help of cloudlets, i.e.,
resource-rich proxies close to the users. In order to satisfy last-hop
bandwidth constraints from the cloudlet to the users, a cloudlet re-samples
viewpoints of the 3D scene into a discrete set of views (combination of
received camera views and virtual views synthesized) to be used as reference
for the synthesis of additional virtual views at the client. This in-network
synthesis leads to better viewpoint sampling given a bandwidth constraint
compared to simple selection of camera views, but it may however carry a
distortion penalty in the cloudlet-synthesized reference views. We therefore
cast a new reference view selection problem where the best subset of views is
defined as the one minimizing the distortion over a view navigation window
defined by the user under some transmission bandwidth constraints. We show that
the view selection problem is NP-hard, and propose an effective polynomial time
algorithm using dynamic programming to solve the optimization problem.
Simulation results finally confirm the performance gain offered by virtual view
synthesis in the network
On the Computation of the Kullback-Leibler Measure for Spectral Distances
Efficient algorithms for the exact and approximate computation of the symmetrical Kullback-Leibler (1998) measure for spectral distances are presented for linear predictive coding (LPC) spectra. A interpretation of this measure is given in terms of the poles of the spectra. The performances of the algorithms in terms of accuracy and computational complexity are assessed for the application of computing concatenation costs in unit-selection-based speech synthesis. With the same complexity and storage requirements, the exact method is superior in terms of accuracy
Model-based synthesis of visual speech movements from 3D video
In this paper we describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system, and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g. HMMs, neural nets) with unit selection we improve the quality of our speech synthesis
Adaptive speech synthesis module with emotional expression
Computer generated speech replaces the conventional text based interaction methods. Initially, speech
synthesis generated human voice that lacked emotional expression. This kind of speech does not
encourage users to interact with computers. Emotional speech synthesis is one of the challenges of speech
synthesize research. The quality of emotional speech synthesis is judged by its intelligibility and
similarity to natural speech.
High quality speech is achievable using the high computational cost unit selection technology. This
technology relays on huge sets of recorded speech segments to achieve optimum quality. On the other
hand, diphone synthesis technology utilizes computational resources and storage spaces. Its quality is less
than unit selection, however, due to the introduction of many digital signal processing algorithms such as
the PSOLA algorithm, more natural results was achievable
- âŚ