17,893 research outputs found
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Lipreading with Long Short-Term Memory
Lipreading, i.e. speech recognition from visual-only recordings of a
speaker's face, can be achieved with a processing pipeline based solely on
neural networks, yielding significantly better accuracy than conventional
methods. Feed-forward and recurrent neural network layers (namely Long
Short-Term Memory; LSTM) are stacked to form a single structure which is
trained by back-propagating error gradients through all the layers. The
performance of such a stacked network was experimentally evaluated and compared
to a standard Support Vector Machine classifier using conventional computer
vision features (Eigenlips and Histograms of Oriented Gradients). The
evaluation was performed on data from 19 speakers of the publicly available
GRID corpus. With 51 different words to classify, we report a best word
accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural
network-based solution (11.6% improvement over the best feature-based solution
evaluated).Comment: Accepted for publication at ICASSP 201
Detecting User Engagement in Everyday Conversations
This paper presents a novel application of speech emotion recognition:
estimation of the level of conversational engagement between users of a voice
communication system. We begin by using machine learning techniques, such as
the support vector machine (SVM), to classify users' emotions as expressed in
individual utterances. However, this alone fails to model the temporal and
interactive aspects of conversational engagement. We therefore propose the use
of a multilevel structure based on coupled hidden Markov models (HMM) to
estimate engagement levels in continuous natural speech. The first level is
comprised of SVM-based classifiers that recognize emotional states, which could
be (e.g.) discrete emotion types or arousal/valence levels. A high-level HMM
then uses these emotional states as input, estimating users' engagement in
conversation by decoding the internal states of the HMM. We report experimental
results obtained by applying our algorithms to the LDC Emotional Prosody and
CallFriend speech corpora.Comment: 4 pages (A4), 1 figure (EPS
Context-Dependent Acoustic Modeling without Explicit Phone Clustering
Phoneme-based acoustic modeling of large vocabulary automatic speech
recognition takes advantage of phoneme context. The large number of
context-dependent (CD) phonemes and their highly varying statistics require
tying or smoothing to enable robust training. Usually, Classification and
Regression Trees are used for phonetic clustering, which is standard in Hidden
Markov Model (HMM)-based systems. However, this solution introduces a secondary
training objective and does not allow for end-to-end training. In this work, we
address a direct phonetic context modeling for the hybrid Deep Neural Network
(DNN)/HMM, that does not build on any phone clustering algorithm for the
determination of the HMM state inventory. By performing different
decompositions of the joint probability of the center phoneme state and its
left and right contexts, we obtain a factorized network consisting of different
components, trained jointly. Moreover, the representation of the phonetic
context for the network relies on phoneme embeddings. The recognition accuracy
of our proposed models on the Switchboard task is comparable and outperforms
slightly the hybrid model using the standard state-tying decision trees.Comment: Submitted to Interspeech 202
- …