1,263 research outputs found
Enhancing robustness in video recognition models : Sparse adversarial attacks and beyond
Recent years have witnessed increasing interest in adversarial attacks on images, while adversarial video attacks have seldom been explored. In this paper, we propose a sparse adversarial attack strategy on videos (DeepSAVA). Our model aims to add a small human-imperceptible perturbation to the key frame of the input video to fool the classifiers. To carry out an effective attack that mirrors real-world scenarios, our algorithm integrates spatial transformation perturbations into the frame. Instead of using the norm to gauge the disparity between the perturbed frame and the original frame, we employ the structural similarity index (SSIM), which has been established as a more suitable metric for quantifying image alterations resulting from spatial perturbations. We employ a unified optimisation framework to combine spatial transformation with additive perturbation, thereby attaining a more potent attack. We design an effective and novel optimisation scheme that alternatively utilises Bayesian Optimisation (BO) to identify the most critical frame in a video and stochastic gradient descent (SGD) based optimisation to produce both additive and spatial-transformed perturbations. Doing so enables DeepSAVA to perform a very sparse attack on videos for maintaining human imperceptibility while still achieving state-of-the-art performance in terms of both attack success rate and adversarial transferability. Furthermore, built upon the strong perturbations produced by DeepSAVA, we design a novel adversarial training framework to improve the robustness of video classification models. Our intensive experiments on various types of deep neural networks and video datasets confirm the superiority of DeepSAVA in terms of attacking performance and efficiency. When compared to the baseline techniques, DeepSAVA exhibits the highest level of performance in generating adversarial videos for three distinct video classifiers. Remarkably, it achieves an impressive fooling rate ranging from 99.5% to 100% for the I3D model, with the perturbation of just a single frame. Additionally, DeepSAVA demonstrates favorable transferability across various time series models. The proposed adversarial training strategy is also empirically demonstrated with better performance on training robust video classifiers compared with the state-of-the-art adversarial training with projected gradient descent (PGD) adversary
Efficient Robustness Assessment via Adversarial Spatial-Temporal Focus on Videos
Adversarial robustness assessment for video recognition models has raised
concerns owing to their wide applications on safety-critical tasks. Compared
with images, videos have much high dimension, which brings huge computational
costs when generating adversarial videos. This is especially serious for the
query-based black-box attacks where gradient estimation for the threat models
is usually utilized, and high dimensions will lead to a large number of
queries. To mitigate this issue, we propose to simultaneously eliminate the
temporal and spatial redundancy within the video to achieve an effective and
efficient gradient estimation on the reduced searching space, and thus query
number could decrease. To implement this idea, we design the novel Adversarial
spatial-temporal Focus (AstFocus) attack on videos, which performs attacks on
the simultaneously focused key frames and key regions from the inter-frames and
intra-frames in the video. AstFocus attack is based on the cooperative
Multi-Agent Reinforcement Learning (MARL) framework. One agent is responsible
for selecting key frames, and another agent is responsible for selecting key
regions. These two agents are jointly trained by the common rewards received
from the black-box threat models to perform a cooperative prediction. By
continuously querying, the reduced searching space composed of key frames and
key regions is becoming precise, and the whole query number becomes less than
that on the original video. Extensive experiments on four mainstream video
recognition models and three widely used action recognition datasets
demonstrate that the proposed AstFocus attack outperforms the SOTA methods,
which is prevenient in fooling rate, query number, time, and perturbation
magnitude at the same.Comment: accepted by TPAMI202
- …