2 research outputs found
Out-of-vocabulary spoken term detection
Spoken term detection (STD) is a fundamental task for multimedia information
retrieval. A major challenge faced by an STD system is the serious performance reduction
when detecting out-of-vocabulary (OOV) terms. The difficulties arise not only
from the absence of pronunciations for such terms in the system dictionaries, but from
intrinsic uncertainty in pronunciations, significant diversity in term properties and a
high degree of weakness in acoustic and language modelling.
To tackle the OOV issue, we first applied the joint-multigram model to predict pronunciations
for OOV terms in a stochastic way. Based on this, we propose a stochastic
pronunciation model that considers all possible pronunciations for OOV terms so that
the high pronunciation uncertainty is compensated for.
Furthermore, to deal with the diversity in term properties, we propose a termdependent
discriminative decision strategy, which employs discriminative models to
integrate multiple informative factors and confidence measures into a classification
probability, which gives rise to minimum decision cost.
In addition, to address the weakness in acoustic and language modelling, we propose
a direct posterior confidence measure which replaces the generative models with
a discriminative model, such as a multi-layer perceptron (MLP), to obtain a robust
confidence for OOV term detection.
With these novel techniques, the STD performance on OOV terms was improved
substantially and significantly in our experiments set on meeting speech data
Automatic determination of sub-word units for automatic speech recognition
Current automatic speech recognition (ASR) research is focused on recognition of continuous,
spontaneous speech. Spontaneous speech contains a lot of variability in the
way words are pronounced, and canonical pronunciations of each word are not true to
the variation that is seen in real data.
Two of the components of an ASR system are acoustic models and pronunciation
models. The variation within spontaneous speech must be accounted for by these
components. Phones, or context-dependent phones are typically used as the base subword
unit, and one acoustic model is trained for each sub-word unit. Pronunciation
modelling largely takes place in a dictionary, which relates words to sequences of phones.
Acoustic modelling and pronunciation modelling overlap, and the two are not clearly
separable in modelling pronunciation variation. Techniques that find pronunciation
variants in the data and then reflect these in the dictionary have not provided expected
gains in recognition.
An alternative approach to modelling pronunciations in terms of phones is to derive
units automatically: using data-driven methods to determine an inventory of sub-word
units, their acoustic models, and their relationship to words. This thesis presents a
method for the automatic derivation of a sub-word unit inventory, whose main components
are
1. automatic and simultaneous generation of a sub-word unit inventory and acoustic
model set, using an ergodic hidden Markov model whose complexity is controlled
using the Bayesian Information Criterion
2. automatic generation of probabilistic dictionaries using joint multigrams
The prerequisites of this approach are fewer than in previous work on unit derivation;
notably, the timings of word boundaries are not required here. The approach is language
independent since it is entirely data-driven and no linguistic information is required.
The dictionary generation method outperforms a supervised method using phonetic
data. The automatically derived units and dictionary perform reasonably on a small
spontaneous speech task, although not yet outperforming phones