299,764 research outputs found
Recognizing GSM Digital Speech
The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source coding distortion and transmission errors. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bitstream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant advantages. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion as a result of the encoding-decoding process. Second, when transmission errors occur, our front-end becomes more effective since it is not affected by errors in bits allocated to the excitation signal. We have considered the half and the full-rate standard codecs and compared the proposed front-end with the conventional approach in two ASR tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated channel conditions. Furthermore, the disparity increases as the network conditions worsen
Modelling the effects of spontaneous speech in speech recognition
Intrinsic variability of the speaker in spontaneous speech
remains a challenge to state of the art Automatic speech
recognition (ASR). While planned speech exhibits a
moderate variability, the significant variability of spontaneous
speech is caused by situation, context, intention,
emotion and listeners. This conditioning of speech is observable
in terms of speaking rate and in feature space.
We analysed broadcast news (BN) and broadcast conversational
(BC) speech in terms of phoneme rate (PR) and
feature space reduction (FSR), and contrasted both with
the planned speech data. Strong statistically significant
differences were revealed. We cluster the speech segments
with respect to their degree of PR and FSR forming
a set of variability classes, and induce the variability
classes into the Hidden-Markov-Model (HMM) based
acoustic model (AM).
In recognition we follow two approaches: the first
considers the variability class as context variable, the second
relies on prior estimation of the variability class after
the first pass of a multi-pass recognition system. Beside
explicit modelling of the intrinsic speech variability
of the speaker, we furthermore segregate the general
speaker specific characteristics by means of speaker
adaptive training (SAT) into feature space transforms using
ConstrainedMaximumLikelihood Linear Regression
(CMLLR), and apply the adaptive approach in third pass
recognition.
By approaching to model both within speaker variation
and between speaker variation in spontaneous
speech, we address two fundamental sources of speech variability that determine the performance of ASR systems.Peer ReviewedPostprint (published version
- …