421 research outputs found
Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
In this paper we address the problem of automatic speech recognition when wireless speech communication systems are involved. In this context, three main sources of distortion should be considered: acoustic environment, speech coding and transmission errors. Whilst the first one has already received a lot of attention, the last two deserve further investigation in our opinion. We have found out that band-pass filtering of the recognition features improves ASR performance when distortions due to these particular communication systems are present. Furthermore, we have evaluated two alternative configurations at different bit error rates (BER) typical of these channels: band-pass filtering the LP-MFCC parameters or a modification of the RASTA-PLP using a sharper low-pass section perform consistently better than LP-MFCC and RASTA-PLP, respectively.Publicad
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous
conversational turns. Due to irrelevant content, error propagation, and
redundancy, existing methods struggle to extract longer and more effective
contexts. To address this issue, we introduce a novel Conversational ASR
system, extending the Conformer encoder-decoder model with cross-modal
conversational representation. Our approach leverages a cross-modal extractor
that combines pre-trained speech and text models through a specialized encoder
and a modal-level mask input. This enables the extraction of richer historical
speech context without explicit error propagation. We also incorporate
conditional latent variational modules to learn conversational level attributes
such as role preference and topic coherence. By introducing both cross-modal
and conversational representations into the decoder, our model retains context
over longer sentences without information loss, achieving relative accuracy
improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and
MagicData-RAMC, respectively, compared to the standard Conformer model.Comment: Submitted to TASL
Recognizing GSM Digital Speech
The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source coding distortion and transmission errors. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bitstream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant advantages. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion as a result of the encoding-decoding process. Second, when transmission errors occur, our front-end becomes more effective since it is not affected by errors in bits allocated to the excitation signal. We have considered the half and the full-rate standard codecs and compared the proposed front-end with the conventional approach in two ASR tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated channel conditions. Furthermore, the disparity increases as the network conditions worsen
- …