3 research outputs found
Data Augmentation for Robust Keyword Spotting under Playback Interference
Accurate on-device keyword spotting (KWS) with low false accept and false
reject rate is crucial to customer experience for far-field voice control of
conversational agents. It is particularly challenging to maintain low false
reject rate in real world conditions where there is (a) ambient noise from
external sources such as TV, household appliances, or other speech that is not
directed at the device (b) imperfect cancellation of the audio playback from
the device, resulting in residual echo, after being processed by the Acoustic
Echo Cancellation (AEC) system. In this paper, we propose a data augmentation
strategy to improve keyword spotting performance under these challenging
conditions. The training set audio is artificially corrupted by mixing in music
and TV/movie audio, at different signal to interference ratios. Our results
show that we get around 30-45% relative reduction in false reject rates, at a
range of false alarm rates, under audio playback from such devices
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning
For real-world speech recognition applications, noise robustness is still a
challenge. In this work, we adopt the teacher-student (T/S) learning technique
using a parallel clean and noisy corpus for improving automatic speech
recognition (ASR) performance under multimedia noise. On top of that, we apply
a logits selection method which only preserves the k highest values to prevent
wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for
transferring data. We incorporate up to 8000 hours of untranscribed data for
training and present our results on sequence trained models apart from cross
entropy trained ones. The best sequence trained student model yields relative
word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our
clean, simulated noisy and real test sets respectively comparing to a sequence
trained teacher.Comment: To Appear in ICASSP 201
Training Keyword Spotters with Limited and Synthesized Speech Data
With the rise of low power speech-enabled devices, there is a growing demand
to quickly produce models for recognizing arbitrary sets of keywords. As with
many machine learning tasks, one of the most challenging parts in the model
creation process is obtaining a sufficient amount of training data. In this
paper, we explore the effectiveness of synthesized speech data in training
small, spoken term detection models of around 400k parameters. Instead of
training such models directly on the audio or low level features such as MFCCs,
we use a pre-trained speech embedding model trained to extract useful features
for keyword spotting models. Using this speech embedding, we show that a model
which detects 10 keywords when trained on only synthetic speech is equivalent
to a model trained on over 500 real examples. We also show that a model without
our speech embeddings would need to be trained on over 4000 real examples to
reach the same accuracy