Investigate the effect of neural network representations on the transferability of adversarial attacks

Abstract

Deep neural networks have been widely applied in various fields of many industries such as medical, security, and self-driving cars. They even surpass human performance in image recognition tasks; however, they have a worrying property. Neural networks are vulnerable to extremely small and human-imperceptible perturbations in images that lead them to provide wrong results with high confidence. Moreover, adversarial images that fool one model can fool another even with different architecture as well. Many studies suggested that a reason for this transferability of adversarial samples is the similar features that different neural networks learn; however, this is just an assumption and remains a gap in our knowledge of adversarial attacks. Our research attempted to validate this assumption and provide better insight into the field of adversarial attacks. We hypothesize that if a neural network representation in one model is highly correlated to the neural network representations of other models, an attack on that network representation would yield better transferability. We tested this hypothesis through experiments with different network architectures as well as datasets. The results were sometimes consistent and sometimes inconsistent with the hypothesis

    Similar works