2 research outputs found
Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Transducer is one of the mainstream frameworks for streaming speech
recognition. There is a performance gap between the streaming and non-streaming
transducer models due to limited context. To reduce this gap, an effective way
is to ensure that their hidden and output distributions are consistent, which
can be achieved by hierarchical knowledge distillation. However, it is
difficult to ensure the distribution consistency simultaneously because the
learning of the output distribution depends on the hidden one. In this paper,
we propose an adaptive two-stage knowledge distillation method consisting of
hidden layer learning and output layer learning. In the former stage, we learn
hidden representation with full context by applying mean square error loss
function. In the latter stage, we design a power transformation based adaptive
smoothness method to learn stable output distribution. It achieved 19\%
relative reduction in word error rate, and a faster response for the first
token compared with the original streaming model in LibriSpeech corpus
Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing
For most automatic speech recognition systems, many unacceptable hypothesis errors still make the recognition results absurd and difficult to understand. In this paper, we introduce the grammar information to improve the performance of the grammatical deviation distance and increase the readability of the hypothesis. The reinforcement of word embedding with grammar embedding is presented to intensify the grammar expression. An auxiliary text-to-grammar task is provided to improve the performance of the recognition results with the downstream task evaluation. Furthermore, the multiple evaluation methodology of grammar is used to explore an expandable usage paradigm with grammar knowledge. Experiments on the small open-source Mandarin speech corpus AISHELL-1 and large private-source Mandarin speech corpus TRANS-M tasks show that our method can perform very well with no additional data. Our method achieves relative character error rate reductions of 3.2% and 5.0%, a relative grammatical deviation distance reduction of 4.7% and 5.9% on AISHELL-1 and TRANS-M tasks, respectively. Moreover, the grammar-based mean opinion score of our method is about 4.29 and 3.20, significantly superior to the baseline of 4.11 and 3.02