17 research outputs found
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation
Neural networks are part of many contemporary NLP systems, yet their
empirical successes come at the price of vulnerability to adversarial attacks.
Previous work has used adversarial training and data augmentation to partially
mitigate such brittleness, but these are unlikely to find worst-case
adversaries due to the complexity of the search space arising from discrete
text perturbations. In this work, we approach the problem from the opposite
direction: to formally verify a system's robustness against a predefined class
of adversarial attacks. We study text classification under synonym replacements
or character flip perturbations. We propose modeling these input perturbations
as a simplex and then using Interval Bound Propagation -- a formal model
verification method. We modify the conventional log-likelihood training
objective to train models that can be efficiently verified, which would
otherwise come with exponential search complexity. The resulting models show
only little difference in terms of nominal accuracy, but have much improved
verified accuracy under perturbations and come with an efficiently computable
formal guarantee on worst case adversaries.Comment: EMNLP 201
Policy Smoothing for Provably Robust Reinforcement Learning
The study of provable adversarial robustness for deep neural networks (DNNs)
has mainly focused on static supervised learning tasks such as image
classification. However, DNNs have been used extensively in real-world adaptive
tasks such as reinforcement learning (RL), making such systems vulnerable to
adversarial attacks as well. Prior works in provable robustness in RL seek to
certify the behaviour of the victim policy at every time-step against a
non-adaptive adversary using methods developed for the static setting. But in
the real world, an RL adversary can infer the defense strategy used by the
victim agent by observing the states, actions, etc. from previous time-steps
and adapt itself to produce stronger attacks in future steps. We present an
efficient procedure, designed specifically to defend against an adaptive RL
adversary, that can directly certify the total reward without requiring the
policy to be robust at each time-step. Our main theoretical contribution is to
prove an adaptive version of the Neyman-Pearson Lemma -- a key lemma for
smoothing-based certificates -- where the adversarial perturbation at a
particular time can be a stochastic function of current and previous
observations and states as well as previous actions. Building on this result,
we propose policy smoothing where the agent adds a Gaussian noise to its
observation at each time-step before passing it through the policy function.
Our robustness certificates guarantee that the final total reward obtained by
policy smoothing remains above a certain threshold, even though the actions at
intermediate time-steps may change under the attack. Our experiments on various
environments like Cartpole, Pong, Freeway and Mountain Car show that our method
can yield meaningful robustness guarantees in practice