2 research outputs found
Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System
The performances of automatic speech recognition (ASR) systems are usually
evaluated by the metric word error rate (WER) when the manually transcribed
data are provided, which are, however, expensively available in the real
scenario. In addition, the empirical distribution of WER for most ASR systems
usually tends to put a significant mass near zero, making it difficult to
simulate with a single continuous distribution. In order to address the two
issues of ASR quality estimation (QE), we propose a novel neural zero-inflated
model to predict the WER of the ASR result without transcripts. We design a
neural zero-inflated beta regression on top of a bidirectional transformer
language model conditional on speech features (speech-BERT). We adopt the
pre-training strategy of token level mask language modeling for speech-BERT as
well, and further fine-tune with our zero-inflated layer for the mixture of
discrete and continuous outputs. The experimental results show that our
approach achieves better performance on WER prediction in the metrics of
Pearson and MAE, compared with most existed quality estimation algorithms for
ASR or machine translation.Comment: InterSpeech 202
Word Error Rate Estimation Without ASR Output: e-WER2
Measuring the performance of automatic speech recognition (ASR) systems
requires manually transcribed data in order to compute the word error rate
(WER), which is often time-consuming and expensive. In this paper, we continue
our effort in estimating WER using acoustic, lexical and phonotactic features.
Our novel approach to estimate the WER uses a multistream end-to-end
architecture. We report results for systems using internal speech decoder
features (glass-box), systems without speech decoder features (black-box), and
for systems without having access to the ASR system (no-box). The no-box system
learns joint acoustic-lexical representation from phoneme recognition results
along with MFCC acoustic features to estimate WER. Considering WER per
sentence, our no-box system achieves 0.56 Pearson correlation with the
reference evaluation and 0.24 root mean square error (RMSE) across 1,400
sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test
set, while the WER computed using the reference transcriptions was 28.5%