712 research outputs found
Globally Normalising the Transducer for Streaming Speech Recognition
The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an
output label sequence as it traverses the input sequence. It is straightforward
to use in streaming mode, where it generates partial hypotheses before the
complete input has been seen. This makes it popular in speech recognition.
However, in streaming mode the Transducer has a mathematical flaw which, simply
put, restricts the model's ability to change its mind. The fix is to replace
local normalisation (e.g. a softmax) with global normalisation, but then the
loss function becomes impossible to evaluate exactly. A recent paper proposes
to solve this by approximating the model, severely degrading performance.
Instead, this paper proposes to approximate the loss function, allowing global
normalisation to apply to a state-of-the-art streaming model. Global
normalisation reduces its word error rate by 9-11% relative, closing almost
half the gap between streaming and lookahead mode.Comment: 9 pages plus references and appendice
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Beam search, which is the dominant ASR decoding algorithm for end-to-end
models, generates tree-structured hypotheses. However, recent studies have
shown that decoding with hypothesis merging can achieve a more efficient search
with comparable or better performance. But, the full context in recurrent
networks is not compatible with hypothesis merging. We propose to use
vector-quantized long short-term memory units (VQ-LSTM) in the prediction
network of RNN transducers. By training the discrete representation jointly
with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN
transducers improve ASR performance over transducers with regular prediction
networks while also producing denser lattices with a very low oracle word error
rate (WER) for the same beam size. Additional language model rescoring
experiments also demonstrate the effectiveness of the proposed lattice
generation scheme.Comment: Interspeech 2022 accepted pape
- …