We present dual-attention neural biasing, an architecture designed to boost
Wake Words (WW) recognition and improve inference time latency on speech
recognition tasks. This architecture enables a dynamic switch for its runtime
compute paths by exploiting WW spotting to select which branch of its attention
networks to execute for an input audio frame. With this approach, we
effectively improve WW spotting accuracy while saving runtime compute cost as
defined by floating point operations (FLOPs). Using an in-house de-identified
dataset, we demonstrate that the proposed dual-attention network can reduce the
compute cost by 90% for WW audio frames, with only 1% increase in the
number of parameters. This architecture improves WW F1 score by 16% relative
and improves generic rare word error rate by 3% relative compared to the
baselines.Comment: Accepted to Proc. IEEE ICASSP 202