6 research outputs found
Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training
Developing a practically-robust automatic speech recognition (ASR) is
challenging since the model should not only maintain the original performance
on clean samples, but also achieve consistent efficacy under small volume
perturbations and large domain shifts. To address this problem, we propose a
novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use
adversarial examples in phoneme space as augmentation to make the model
invariant to minor fluctuations in phoneme representation and preserve the
performance on clean samples. In addition, wapat utilizes the phoneme
representation of augmented samples to guide the generation of adversaries,
which helps to find more stable and diverse gradient-directions, resulting in
improved generalization. Extensive experiments demonstrate the effectiveness of
wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat
outperforms the original model by 6.28% WER reduction on ESB, achieving the new
state-of-the-art
Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging
Fast Adversarial Training (FAT) not only improves the model robustness but
also reduces the training cost of standard adversarial training. However, fast
adversarial training often suffers from Catastrophic Overfitting (CO), which
results in poor robustness performance. Catastrophic Overfitting describes the
phenomenon of a sudden and significant decrease in robust accuracy during the
training of fast adversarial training. Many effective techniques have been
developed to prevent Catastrophic Overfitting and improve the model robustness
from different perspectives. However, these techniques adopt inconsistent
training settings and require different training costs, i.e, training time and
memory costs, leading to unfair comparisons. In this paper, we conduct a
comprehensive study of over 10 fast adversarial training methods in terms of
adversarial robustness and training costs. We revisit the effectiveness and
efficiency of fast adversarial training techniques in preventing Catastrophic
Overfitting from the perspective of model local nonlinearity and propose an
effective Lipschitz regularization method for fast adversarial training.
Furthermore, we explore the effect of data augmentation and weight averaging in
fast adversarial training and propose a simple yet effective auto weight
averaging method to improve robustness further. By assembling these techniques,
we propose a FGSM-based fast adversarial training method equipped with
Lipschitz regularization and Auto Weight averaging, abbreviated as FGSM-LAW.
Experimental evaluations on four benchmark databases demonstrate the
superiority of the proposed method over state-of-the-art fast adversarial
training methods and the advanced standard adversarial training methods
Enhance the Visual Representation via Discrete Adversarial Training
Adversarial Training (AT), which is commonly accepted as one of the most
effective approaches defending against adversarial examples, can largely harm
the standard performance, thus has limited usefulness on industrial-scale
production and applications. Surprisingly, this phenomenon is totally opposite
in Natural Language Processing (NLP) task, where AT can even benefit for
generalization. We notice the merit of AT in NLP tasks could derive from the
discrete and symbolic input space. For borrowing the advantage from NLP-style
AT, we propose Discrete Adversarial Training (DAT). DAT leverages VQGAN to
reform the image data to discrete text-like inputs, i.e. visual words. Then it
minimizes the maximal risk on such discrete images with symbolic adversarial
perturbations. We further give an explanation from the perspective of
distribution to demonstrate the effectiveness of DAT. As a plug-and-play
technique for enhancing the visual representation, DAT achieves significant
improvement on multiple tasks including image classification, object detection
and self-supervised learning. Especially, the model pre-trained with Masked
Auto-Encoding (MAE) and fine-tuned by our DAT without extra data can get 31.40
mCE on ImageNet-C and 32.77% top-1 accuracy on Stylized-ImageNet, building the
new state-of-the-art. The code will be available at
https://github.com/alibaba/easyrobust.Comment: Accepted to NeurIPS 2022, https://github.com/alibaba/easyrobus
Adversarial camouflage: Hiding physical-world attacks with natural styles
Deep neural networks (DNNs) are known to be vulnerable to adversarial
examples. Existing works have mostly focused on either digital adversarial
examples created via small and imperceptible perturbations, or physical-world
adversarial examples created with large and less realistic distortions that are
easily identified by human observers. In this paper, we propose a novel
approach, called Adversarial Camouflage (\emph{AdvCam}), to craft and
camouflage physical-world adversarial examples into natural styles that appear
legitimate to human observers. Specifically, \emph{AdvCam} transfers large
adversarial perturbations into customized styles, which are then "hidden"
on-target object or off-target background. Experimental evaluation shows that,
in both digital and physical-world scenarios, adversarial examples crafted by
\emph{AdvCam} are well camouflaged and highly stealthy, while remaining
effective in fooling state-of-the-art DNN image classifiers. Hence,
\emph{AdvCam} is a flexible approach that can help craft stealthy attacks to
evaluate the robustness of DNNs. \emph{AdvCam} can also be used to protect
private information from being detected by deep learning systems.Comment: Accepted to CVPR202
Towards Robust Vision Transformer
Recent advances on Vision Transformer (ViT) and its improved variants have
shown that self-attention-based networks surpass traditional Convolutional
Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on
the standard accuracy and computation cost, lacking the investigation of the
intrinsic influence on model robustness and generalization. In this work, we
conduct systematic evaluation on components of ViTs in terms of their impact on
robustness to adversarial examples, common corruptions and distribution shifts.
We find some components can be harmful to robustness. By using and combining
robust components as building blocks of ViTs, we propose Robust Vision
Transformer (RVT), which is a new vision transformer and has superior
performance with strong robustness. We further propose two new plug-and-play
techniques called position-aware attention scaling and patch-wise augmentation
to augment our RVT, which we abbreviate as RVT*. The experimental results on
ImageNet and six robustness benchmarks show the advanced robustness and
generalization ability of RVT compared with previous ViTs and state-of-the-art
CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple robustness
leaderboards including ImageNet-C and ImageNet-Sketch. The code will be
available at \url{https://github.com/alibaba/easyrobust}.Comment: Accepted to CVPR 2022, https://github.com/alibaba/easyrobus