2 research outputs found

    Acoustic and perceptual analysis of discontinuities in two TTS concatenation systems

    Get PDF
    It is fair to say that L&H’s (now Scansoft’s) RealSpeak and AT&T’s NextGen are two of the most natural sounding unit selection systems. The transitions between connected units sometimes contain discontinuities, thus creating one of the greatest problems concerning the output in these kinds of systems. The discontinuities are often perceived as ‘jumps’, i.e. a disturbance. The analyses in this paper investigate the acoustic properties of the ‘jumps’, if they are perceived as disturbing and in that case how disturbing. The results show that the selection criteria do not include enough information on single acoustic parameters, such as formants. Since listeners perceive discontinuities in formants, especially F2, as disturbing, one of the conclusions is that the next step in developing these systems must be to include more information on these parameters separately (especially formants 2 and 3) to improve the selection process. Of course other things like increasing database size and better structuring of data etc. can also improve the selection process as well as better grapheme to phoneme conversion, but those aspects are not dealt with here

    ISCA Archive Definition of a Training Set for Unit Selection-Based Speech Synthesis

    No full text
    The definition of cost terms in unit selection based synthesis is a difficult task. Usually cost terms are based on common phonetic knowledge of the developers and subsequent perceptual experiments. The dataset used for supervised learning, well known from pattern recognition, could be a useful way to arrive at a more formal analysis of the different factors influencing the selection of units. As a first step toward this aim we present an objective distance measure which is used to sort the units contained in the corpus in relation to a given natural unit and prove its relevance to human perception. To avoid too much attention of the listeners to discontinuities caused by concatenation, we will also present a waveform-based smoothing algorithm. It is experimentally shown that the sorting criterion and the human perception match in most cases. Furthermore it can be detected that similarity between natural and synthetic speech is better if phoneme based units are used, but naturalness increases with the concatenation of larger units. 1
    corecore