End-to-end (E2E) automatic speech recognition (ASR) methods exhibit
remarkable performance. However, since the performance of such methods is
intrinsically linked to the context present in the training data, E2E-ASR
methods do not perform as desired for unseen user contexts (e.g., technical
terms, personal names, and playlists). Thus, E2E-ASR methods must be easily
contextualized by the user or developer. This paper proposes an attention-based
contextual biasing method that can be customized using an editable phrase list
(referred to as a bias list). The proposed method can be trained effectively by
combining a bias phrase index loss and special tokens to detect the bias
phrases in the input speech data. In addition, to improve the contextualization
performance during inference further, we propose a bias phrase boosted (BPB)
beam search algorithm based on the bias phrase index probability. Experimental
results demonstrate that the proposed method consistently improves the word
error rate and the character error rate of the target phrases in the bias list
on both the Librispeech-960 (English) and our in-house (Japanese) dataset,
respectively.Comment: accepted by ICASSP2022