5,117 research outputs found
Pauses and the temporal structure of speech
Natural-sounding speech synthesis requires close control over the temporal structure of the speech flow. This includes a full predictive scheme for the durational structure and in particuliar the prolongation of final syllables of lexemes as well as for the pausal structure in the utterance. In this chapter, a description of the temporal structure and the summary of the numerous factors that modify it are presented. In the second part, predictive schemes for the temporal structure of speech ("performance structures") are introduced, and their potential for characterising the overall prosodic structure of speech is demonstrated
Production and perception of speaker-specific phonetic detail at word boundaries
Experiments show that learning about familiar voices affects speech processing in many tasks. However, most studies focus on isolated phonemes or words and do not explore which phonetic properties are learned about or retained in memory. This work investigated inter-speaker phonetic variation involving word boundaries, and its perceptual consequences. A production experiment found significant variation in the extent to which speakers used a number of acoustic properties to distinguish junctural minimal pairs e.g. 'So he diced them'—'So he'd iced them'. A perception experiment then tested intelligibility in noise of the junctural minimal pairs before and after familiarisation with a particular voice. Subjects who heard the same voice during testing as during the familiarisation period showed significantly more improvement in identification of words and syllable constituents around word boundaries than those who heard different voices. These data support the view that perceptual learning about the particular pronunciations associated with individual speakers helps listeners to identify syllabic structure and the location of word boundaries
Improving parsing of spontaneous speech with the help of prosodic boundaries
Parsing can be improved in automatic speech understanding if prosodic boundary marking is taken into account, because syntactic boundaries are often marked by prosodic means. Because large databases are needed for the training of statistical models for prosodic boundaries, we developed a labeling scheme for syntactic-prosodic boundaries within the German VERBMOBIL project (automatic speech-to-speech translation). We compare the results of classifiers (multi-layer perceptrons and language models) trained on these syntactic-prosodic boundary labels with classifiers trained on perceptual-prosodic and purely syntactic labels. Recognition rates of up to 96% were achieved. The turns that we need to parse consist of 20 words on the average and frequently contain sequences of partial sentence equivalents due to restarts, ellipsis, etc. For this material, the boundary scores computed by our classifiers can successfully be integrated into the syntactic parsing of word graphs; currently, they improve the parse time by 92% and reduce the number of parse trees by 96%. This is achieved by introducing a special Prosodic Syntactic Clause Boundary symbol (PSCB) into our grammar and guiding the search for the best word chain with the prosodic boundary scores
From Monologue to Dialogue: Natural Language Generation in OVIS
This paper describes how a language generation system that was originally designed for monologue generation, has been adapted for use in the OVIS spoken dialogue system. To meet the requirement that in a dialogue, the system's utterances should make up a single, coherent dialogue turn, several modifications had to be made to the system. The paper also discusses the influence of dialogue context on information status, and its consequences for the generation of referring expressions and accentuation
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Improving parsing by incorporating "prosodic clause boundaries" into a grammar
In written language, punctuation is used to separate main and subordinate clause. In spoken language, ambiguities arise due to missing punctuation, but clause boundaries are often marked prosodically and can be used instead. We detect PCBs (Prosodically markedClauseBoundaries) by using prosodic features (duration, intonation, energy, and pause information) with a neural network, achieving a recognition rate of 82%. PCBs are integrated into our grammar using a special syntactic category "break" that can be used in the phrase-structure rules of the grammar in a similar way as punctuation is used in grammars for written language. Whereas punctuation in most cases is obligatory, PCBs are sometimes optional. Moreover, they can in principle occur everywhere in the sentence due e.g. to hesitations or misrecognition. To cope with these problems we tested two different approaches: A slightly modified parser for word chains containing PCBs and a word graph parser that takes the probabilities of PCBs into account. Tests were conducted on a subset of infinitive subordinate clauses from a large speech database containing sentences from the domain of train table inquiries. The average number of syntactic derivations could be reduced by about 70 % even when working on recognized word graphs
Prosodic modules for speech recognition and understanding in VERBMOBIL
Within VERBMOBIL, a large project on spoken language research in Germany, two modules for detecting and recognizing prosodic events have been developed. One module operates on speech signal parameters and the word hypothesis graph, whereas the other module, designed for a novel, highly interactive architecture, only uses speech signal parameters as its input. Phrase boundaries, sentence modality, and accents are detected. The recognition rates in spontaneous dialogs are for accents up to 82,5%, for phrase boundaries up to 91,7%
- …