As one popular modeling approach for end-to-end speech recognition,
attention-based encoder-decoder models are known to suffer the length bias and
corresponding beam problem. Different approaches have been applied in simple
beam search to ease the problem, most of which are heuristic-based and require
considerable tuning. We show that heuristics are not proper modeling
refinement, which results in severe performance degradation with largely
increased beam sizes. We propose a novel beam search derived from
reinterpreting the sequence posterior with an explicit length modeling. By
applying the reinterpreted probability together with beam pruning, the obtained
final probability leads to a robust model modification, which allows reliable
comparison among output sequences of different lengths. Experimental
verification on the LibriSpeech corpus shows that the proposed approach solves
the length bias problem without heuristics or additional tuning effort. It
provides robust decision making and consistently good performance under both
small and very large beam sizes. Compared with the best results of the
heuristic baseline, the proposed approach achieves the same WER on the 'clean'
sets and 4% relative improvement on the 'other' sets. We also show that it is
more efficient with the additional derived early stopping criterion.Comment: accepted at INTERSPEECH202