Search CORE

3 research outputs found

Data Augmentation for Robust Keyword Spotting under Playback Interference

Author: Liu Xing
Mandal Arindam
Panchapagesan Sankaran
Raju Anirudh
Strom Nikko
Publication venue
Publication date: 01/08/2018
Field of study

Accurate on-device keyword spotting (KWS) with low false accept and false reject rate is crucial to customer experience for far-field voice control of conversational agents. It is particularly challenging to maintain low false reject rate in real world conditions where there is (a) ambient noise from external sources such as TV, household appliances, or other speech that is not directed at the device (b) imperfect cancellation of the audio playback from the device, resulting in residual echo, after being processed by the Acoustic Echo Cancellation (AEC) system. In this paper, we propose a data augmentation strategy to improve keyword spotting performance under these challenging conditions. The training set audio is artificially corrupted by mixing in music and TV/movie audio, at different signal to interference ratios. Our results show that we get around 30-45% relative reduction in false reject rates, at a range of false alarm rates, under audio playback from such devices

arXiv.org e-Print Archive

Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

Author: Hoffmeister Björn
Kumatani Kenichi
Maas Roland
Mošner Ladislav
Parthasarathi Sree Hari Krishnan
Raju Anirudh
Sundaram Shiva
Wu Minhua
Publication venue
Publication date: 15/03/2019
Field of study

For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher.Comment: To Appear in ICASSP 201

arXiv.org e-Print Archive

Training Keyword Spotters with Limited and Synthesized Speech Data

Author: Kilgour Kevin
Lin James
Roblek Dominik
Sharifi Matthew
Publication venue
Publication date: 31/01/2020
Field of study

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the same accuracy

arXiv.org e-Print Archive