Personalization in multi-turn dialogs has been a long standing challenge for
end-to-end automatic speech recognition (E2E ASR) models. Recent work on
contextual adapters has tackled rare word recognition using user catalogs. This
adaptation, however, does not incorporate an important cue, the dialog act,
which is available in a multi-turn dialog scenario. In this work, we propose a
dialog act guided contextual adapter network. Specifically, it leverages dialog
acts to select the most relevant user catalogs and creates queries based on
both -- the audio as well as the semantic relationship between the carrier
phrase and user catalogs to better guide the contextual biasing. On industrial
voice assistant datasets, our model outperforms both the baselines - dialog act
encoder-only model, and the contextual adaptation, leading to the most
improvement over the no-context model: 58% average relative word error rate
reduction (WERR) in the multi-turn dialog scenario, in comparison to the
prior-art contextual adapter, which has achieved 39% WERR over the no-context
model.Comment: Accepted at ICASSP 202