2 research outputs found
Revisiting the Entropy Semiring for Neural Speech Recognition
In streaming settings, speech recognition models have to map sub-sequences of
speech to text before the full audio stream becomes available. However, since
alignment information between speech and text is rarely available during
training, models need to learn it in a completely self-supervised way. In
practice, the exponential number of possible alignments makes this extremely
challenging, with models often learning peaky or sub-optimal alignments. Prima
facie, the exponential nature of the alignment space makes it difficult to even
quantify the uncertainty of a model's alignment distribution. Fortunately, it
has been known for decades that the entropy of a probabilistic finite state
transducer can be computed in time linear to the size of the transducer via a
dynamic programming reduction based on semirings. In this work, we revisit the
entropy semiring for neural speech recognition models, and show how alignment
entropy can be used to supervise models through regularization or distillation.
We also contribute an open-source implementation of CTC and RNN-T in the
semiring framework that includes numerically stable and highly parallel
variants of the entropy semiring. Empirically, we observe that the addition of
alignment distillation improves the accuracy and latency of an already
well-optimized teacher-student distillation model, achieving state-of-the-art
performance on the Librispeech dataset in the streaming scenario