Search CORE

24,321 research outputs found

Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

Author: Andreas Stolcke
Bahl
Baum
Breiman
Brown
Bruce
Buntine
Dermatas
Dilek Hakkani-Tür
Elizabeth Shriberg
Gökhan Tür
Hearst
Katz
Palmer
Shriberg
Sluijter
Swerts
Swerts
Swerts
Thorsen
Viterbi
Publication venue
Publication date: 01/01/2000
Field of study

A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

Speaker emotion can affect ambiguity production

Author: Kempe Vera
Rookes Melissa
Swarbrigg Laura
Publication venue
Publication date: 01/01/2013
Field of study

Does speaker emotion affect degree of ambiguity in referring expressions? We used referential communication tasks preceded by mood induction to examine whether positive emotional valence may be linked to ambiguity of referring expressions. In Experiment 1, participants had to identify sequences of objects with homophonic labels (e.g., the animal bat, a baseball bat) for hypothetical addressees. This required modification of the homophones. Happy speakers were less likely to modify the second homophone to repair a temporary ambiguity (i.e., they were less likely to say … first cover the bat, then cover the baseball bat …). In Experiment 2, participants had to identify one of two identical objects in an object array, which required a modifying relative clause (the shark that's underneath the shoe). Happy speakers omitted the modifying relative clause twice as often as neutral speakers (e.g., by saying Put the shark underneath the sheep), thereby rendering the entire utterance ambiguous in the context of two sharks. The findings suggest that one consequence of positive mood appears to be more ambiguity in speech. This effect is hypothesised to be due to a less effortful processing style favouring an egocentric bias impacting perspective taking or monitoring of alignment of utterances with an addressee's perspective

Abertay Research Portal

BridgeNets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and its Application to Distant Speech Recognition

Author: El-Khamy Mostafa
Kim Jaeyoung
Lee Jungwon
Publication venue
Publication date: 21/02/2018
Field of study

Despite the remarkable progress achieved on automatic speech recognition, recognizing far-field speeches mixed with various noise sources is still a challenging task. In this paper, we introduce novel student-teacher transfer learning, BridgeNet which can provide a solution to improve distant speech recognition. There are two key features in BridgeNet. First, BridgeNet extends traditional student-teacher frameworks by providing multiple hints from a teacher network. Hints are not limited to the soft labels from a teacher network. Teacher's intermediate feature representations can better guide a student network to learn how to denoise or dereverberate noisy input. Second, the proposed recursive architecture in the BridgeNet can iteratively improve denoising and recognition performance. The experimental results of BridgeNet showed significant improvements in tackling the distant speech recognition problem, where it achieved up to 13.24% relative WER reductions on AMI corpus compared to a baseline neural network without teacher's hints.Comment: Accepted to 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018

arXiv.org e-Print Archive

Crossref