The State of the Art in Speaker Adaptation for Automatic Speech Recognition (ASR)

Abstract

Automatic speech recognition (ASR) incorporates knowledge and research in linguistics, computer science and electrical engineering to develop methodologies and algorithms to translate human speech into text. In ASR, speaker adaptation refers to the technologies that adapt acoustic features to better model the variation for individual speakers. Its goal is to reduce the mismatch between individual speakers and the acoustic model in order to reduce the word error rate (WER). Adaptation strategies include long short-term memory recurrent neural networks (LSTM-RNN), maximum likelihood linear regression (MLLR) for hidden Markov models (HMM), and I-vectors. Recently, deep neural networks (DNN) have become an alternative modeling approach. Combined with older adaptation techniques, DNNs have improved ASR performance significantly. This research presents a review of adaptation techniques used with DNNs, examines existing experimental results, and investigate speaker difference in recognition using a virtual machine (VM) from the Speech Recognition Virtual Kitchen (SRVK). The SRVK toolkit is comprised of Linux-based VMs which allow users at teaching-focused institutions to participate in ASR research. The TI-digits will be used as training datasets, as they have sufficient individual speaker data to separate for adaptation experiments. WER is the main indicator for performance evaluation. The work presented includes discussion and comparison results of each strategy used with DNN, an overview of the SRVK toolkit, results of recognition performance, and potential methods to improve adaptation within the toolkit

    Similar works