2 research outputs found
MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition
Audio Adversarial Examples (AAE) represent specially created inputs meant to
trick Automatic Speech Recognition (ASR) systems into misclassification. The
present work proposes MP3 compression as a means to decrease the impact of
Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this
end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end,
hybrid CTC-attention ASR system. Our method is then validated by two objective
indicators: (1) Character Error Rates (CER) that measure the speech decoding
performance of four ASR models trained on uncompressed, as well as
MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both
uncompressed and MP3-compressed AAEs that are reconstructed in the time domain
by feature inversion. We found that MP3 compression applied to AAEs indeed
reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted
(reconstructed) AAEs had significantly higher SNRs after MP3 compression,
indicating that AN was reduced. In contrast to AN, MP3 compression applied to
utterances augmented with regular noise resulted in more transcription errors,
giving further evidence that MP3 encoding is effective in diminishing only AN.Comment: Submitted and accepted at SPECOM 2020 conferenc
Audio Adversarial Examples for Robust Hybrid CTC/Attention Speech Recognition
Recent advances in Automatic Speech Recognition (ASR) demonstrated how
end-to-end systems are able to achieve state-of-the-art performance. There is a
trend towards deeper neural networks, however those ASR models are also more
complex and prone against specially crafted noisy data. Those Audio Adversarial
Examples (AAE) were previously demonstrated on ASR systems that use
Connectionist Temporal Classification (CTC), as well as attention-based
encoder-decoder architectures. Following the idea of the hybrid CTC/attention
ASR system, this work proposes algorithms to generate AAEs to combine both
approaches into a joint CTC-attention gradient method. Evaluation is performed
using a hybrid CTC/attention end-to-end ASR model on two reference sentences as
case study, as well as the TEDlium v2 speech recognition task. We then
demonstrate the application of this algorithm for adversarial training to
obtain a more robust ASR model.Comment: To be published at SPECOM 202