Confidence Score Based Speaker Adaptation of Conformer Speech
  Recognition Systems

Cui, Mingyu; Deng, Jiajun; Hu, Shujie; Jin, Zengrui; Li, Guinan; Liu, Xunying; Wang, Tianzi; Xie, Xurong; Xue, Boyang

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Authors: Mingyu Cui
Jiajun Deng
Shujie Hu
Zengrui Jin
Guinan Li
Xunying Liu
Tianzi Wang
Xurong Xie
Boyang Xue
Publication date: 15 February 2023
Publisher

Abstract

Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.Comment: IEEE/ACM Transactions on Audio, Speech, and Language Processin

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2302.07521

Last time updated on 06/03/2023