128 research outputs found

    Utilizing Statistical Dialogue Act Processing in Verbmobil

    Get PDF
    In this paper, we present a statistical approach for dialogue act processing in the dialogue component of the speech-to-speech translation system \vm. Statistics in dialogue processing is used to predict follow-up dialogue acts. As an application example we show how it supports repair when unexpected dialogue states occur.Comment: 6 pages; compressed and uuencoded postscript file; to appear in ACL-9

    A Robust and Efficient Three-Layered Dialogue Component for a Speech-to-Speech Translation System

    Get PDF
    We present the dialogue component of the speech-to-speech translation system VERBMOBIL. In contrast to conventional dialogue systems it mediates the dialogue while processing maximally 50% of the dialogue in depth. Special requirements like robustness and efficiency lead to a 3-layered hybrid architecture for the dialogue module, using statistics, an automaton and a planner. A dialogue memory is constructed incrementally.Comment: Postscript file, compressed and uuencoded, 15 pages, to appear in Proceedings of EACL-95, Dublin

    On the use of voice descriptors for glottal source shape parameter estimation

    Get PDF
    International audienceThis paper summarizes the results of our investigations into estimating the shape of the glottal excitation source from speech signals. We employ the Liljencrants-Fant (LF) model describing the glottal flow and its derivative. The one-dimensional glottal source shape parameter Rd describes the transition in voice quality from a tense to a breathy voice. The parameter Rd has been derived from a statistical regression of the R waveshape parameters which parameterize the LF model. First, we introduce a variant of our recently proposed adaptation and range extension of the Rd parameter regression. Secondly, we discuss in detail the aspects of estimating the glottal source shape parameter Rd using the phase minimization paradigm. Based on the analysis of a large number of speech signals we describe the major conditions that are likely to result in erroneous Rd estimates. Based on these findings we investigate into means to increase the robustness of the Rd parameter estimation. We use Viterbi smoothing to suppress unnatural jumps of the estimated Rd parameter contours within short time segments. Additionally, we propose to steer the Viterbi algorithm by exploiting the covariation of other voice descriptors to improve Viterbi smoothing. The novel Viterbi steering is based on a Gaussian Mixture Model (GMM) that represents the joint density of the voice descriptors and the Open Quotient (OQ) estimated from corresponding electroglottographic (EGG) signals. A conversion function derived from the mixture model predicts OQ from the voice descriptors. Converted to Rd it defines an additional prior probability to adapt the partial probabilities of the Viterbi algorithm accordingly. Finally, we evaluate the performances of the phase minimization based methods using both variants to adapt and extent the Rd regression on one synthetic test set as well as in combination with Viterbi smoothing and each variant of the novel Viterbi steering on one test set of natural speech. The experimental findings exhibit improvements for both Viterbi approaches

    Modeling of Speech Parameter Sequence Considering Global Variance for HMM-Based Speech Synthesis

    Get PDF
    Speech technologies such as speech recognition and speech synthesis have many potential applications since speech is the main way in which most people communicate. Various linguistic sounds are produced by controlling the configuration of oral cavities to convey a message in speech communication. The produced speech sounds temporally vary and ar

    Imitation/self-imitation in computer-assisted prosody training for Chinese learners of L2 Italian.

    Get PDF
    Recent studies on L2 acquisition, speech synthesis and automatic identification of foreign accents argue for a major role of prosody in the perception of non-native speech. Research on the relationship between pronunciation improvement and student/teachers’ voice similarities has also shown that the better the match between the learners' and native speakers' voices in terms of f0 and articulation rate, the more positive the impact on pronunciation training. This study investigates the effects of imitation and self-imitation on the acquisition of L2 suprasegmental patterns. Degree of foreign accent, improvements in intelligibility, and effectiveness of communication were measured to determine the success of each technique. For this purpose, a prosodic transplantation technique and a computer-assisted learning methodology were used. Recent studies on L2 acquisition, speech synthesis and automatic identification of foreign accents argue for a major role of prosody in the perception of non-native speech. Research on the relationship between pronunciation improvement and student/teachers’ voice similarities has also shown that the better the match between the learners' and native speakers' voices in terms of f0 and articulation rate, the more positive the impact on pronunciation training. This study investigates the effects of imitation and self-imitation on the acquisition of L2 suprasegmental patterns. Degree of foreign accent, improvements in intelligibility, and effectiveness of communication were measured to determine the success of each technique. For this purpose, a prosodic transplantation technique and a computer-assisted learning methodology were used

    Interactive speech understanding

    Get PDF

    An Efficient Unit-Selection Method for Concatenative Text-to-Speech Synthesis Systems

    Get PDF
    This paper presents a method for selecting speech units for polyphone concatenative speech synthesis, in which the simplification of procedures for search paths in a graph accelerated the speed of the unit-selection procedure with minimum effects on the speech quality. The speech units selected are still optimal; only the costs of merging the units on which the selection is based are less accurately determined. Due to its low processing power and memory footprint requirements, the method is suitable for use in embedded speech synthesizers
    • …
    corecore