10,308 research outputs found
A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition
This article provides a unifying Bayesian network view on various approaches
for acoustic model adaptation, missing feature, and uncertainty decoding that
are well-known in the literature of robust automatic speech recognition. The
representatives of these classes can often be deduced from a Bayesian network
that extends the conventional hidden Markov models used in speech recognition.
These extensions, in turn, can in many cases be motivated from an underlying
observation model that relates clean and distorted feature vectors. By
converting the observation models into a Bayesian network representation, we
formulate the corresponding compensation rules leading to a unified view on
known derivations as well as to new formulations for certain approaches. The
generic Bayesian perspective provided in this contribution thus highlights
structural differences and similarities between the analyzed approaches
Filler model based confidence measures for spoken dialogue systems: a case study for Turkish
Because of the inadequate performance of speech recognition systems, an accurate confidence scoring mechanism should be employed to understand user requests correctly. To determine a confidence score for a hypothesis, certain confidence features are combined. The performance of filler-model based confidence features have been investigated. Five types of filler model networks were defined: triphone-network; phone-network; phone-class network; 5-state catch-all model; 3-state catch-all model. First, all models were evaluated in a Turkish speech recognition task in terms of their ability to tag correctly (recognition-error or correct) recognition hypotheses. The best performance was obtained from the triphone recognition network. Then, the performances of reliable combinations of these models were investigated and it was observed that certain combinations of filler models could significantly improve the accuracy of the confidence annotatio
Blind Normalization of Speech From Different Channels
We show how to construct a channel-independent representation of speech that
has propagated through a noisy reverberant channel. This is done by blindly
rescaling the cepstral time series by a non-linear function, with the form of
this scale function being determined by previously encountered cepstra from
that channel. The rescaled form of the time series is an invariant property of
it in the following sense: it is unaffected if the time series is transformed
by any time-independent invertible distortion. Because a linear channel with
stationary noise and impulse response transforms cepstra in this way, the new
technique can be used to remove the channel dependence of a cepstral time
series. In experiments, the method achieved greater channel-independence than
cepstral mean normalization, and it was comparable to the combination of
cepstral mean normalization and spectral subtraction, despite the fact that no
measurements of channel noise or reverberations were required (unlike spectral
subtraction).Comment: 25 pages, 7 figure
- …