4,056 research outputs found
VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
The performance of the keyword spotting (KWS) system based on audio modality,
commonly measured in false alarms and false rejects, degrades significantly
under the far field and noisy conditions. Therefore, audio-visual keyword
spotting, which leverages complementary relationships over multiple modalities,
has recently gained much attention. However, current studies mainly focus on
combining the exclusively learned representations of different modalities,
instead of exploring the modal relationships during each respective modeling.
In this paper, we propose a novel visual modality enhanced end-to-end KWS
framework (VE-KWS), which fuses audio and visual modalities from two aspects.
The first one is utilizing the speaker location information obtained from the
lip region in videos to assist the training of multi-channel audio beamformer.
By involving the beamformer as an audio enhancement module, the acoustic
distortions, caused by the far field or noisy environments, could be
significantly suppressed. The other one is conducting cross-attention between
different modalities to capture the inter-modal relationships and help the
representation learning of each modality. Experiments on the MSIP challenge
corpus show that our proposed model achieves 2.79% false rejection rate and
2.95% false alarm rate on the Eval set, resulting in a new SOTA performance
compared with the top-ranking systems in the ICASSP2022 MISP challenge.Comment: 5 pages. Accepted at ICASSP202
- …