4 research outputs found
On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech
Advanced neural network models have penetrated Automatic Speech Recognition
(ASR) in recent years, however, in language modeling many systems still rely on
traditional Back-off N-gram Language Models (BNLM) partly or entirely. The
reason for this are the high cost and complexity of training and using neural
language models, mostly possible by adding a second decoding pass (rescoring).
In our recent work we have significantly improved the online performance of a
conversational speech transcription system by transferring knowledge from a
Recurrent Neural Network Language Model (RNNLM) to the single pass BNLM with
text generation based data augmentation. In the present paper we analyze the
amount of transferable knowledge and demonstrate that the neural augmented LM
(RNN-BNLM) can help to capture almost 50% of the knowledge of the RNNLM yet by
dropping the second decoding pass and making the system real-time capable. We
also systematically compare word and subword LMs and show that subword-based
neural text augmentation can be especially beneficial in under-resourced
conditions. In addition, we show that using the RNN-BNLM in the first pass
followed by a neural second pass, offline ASR results can be even significantly
improved.Comment: 8 pages, 2 figures, accepted for publication at TSD 202
Investigation on N-gram Approximated RNNLMs for Recognition of Morphologically Rich Speech
Recognition of Hungarian conversational telephone speech is challenging due
to the informal style and morphological richness of the language. Recurrent
Neural Network Language Model (RNNLM) can provide remedy for the high
perplexity of the task; however, two-pass decoding introduces a considerable
processing delay. In order to eliminate this delay we investigate approaches
aiming at the complexity reduction of RNNLM, while preserving its accuracy. We
compare the performance of conventional back-off n-gram language models (BNLM),
BNLM approximation of RNNLMs (RNN-BNLM) and RNN n-grams in terms of perplexity
and word error rate (WER). Morphological richness is often addressed by using
statistically derived subwords - morphs - in the language models, hence our
investigations are extended to morph-based models, as well. We found that using
RNN-BNLMs 40% of the RNNLM perplexity reduction can be recovered, which is
roughly equal to the performance of a RNN 4-gram model. Combining morph-based
modeling and approximation of RNNLM, we were able to achieve 8% relative WER
reduction and preserve real-time operation of our conversational telephone
speech recognition system.Comment: 12 pages, 2 figures, accepted for publication at SLSP 201
Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important.Peer reviewe