Search CORE

2 research outputs found

An Information-Theoretic Explanation for the Adversarial Fragility of AI Classifiers

Author: Mudumbai Raghu
Xie Hui
Xu Weiyu
Yi Jirong
Publication venue
Publication date: 27/01/2019
Field of study

We present a simple hypothesis about a compression property of artificial intelligence (AI) classifiers and present theoretical arguments to show that this hypothesis successfully accounts for the observed fragility of AI classifiers to small adversarial perturbations. We also propose a new method for detecting when small input perturbations cause classifier errors, and show theoretical guarantees for the performance of this detection method. We present experimental results with a voice recognition system to demonstrate this method. The ideas in this paper are motivated by a simple analogy between AI classifiers and the standard Shannon model of a communication system.Comment: 5 page

arXiv.org e-Print Archive

Derivation of Information-Theoretically Optimal Adversarial Attacks with Applications to Robust Machine Learning

Author: Mudumbai Raghu
Xu Weiyu
Yi Jirong
Publication venue
Publication date: 28/07/2020
Field of study

We consider the theoretical problem of designing an optimal adversarial attack on a decision system that maximally degrades the achievable performance of the system as measured by the mutual information between the degraded signal and the label of interest. This problem is motivated by the existence of adversarial examples for machine learning classifiers. By adopting an information theoretic perspective, we seek to identify conditions under which adversarial vulnerability is unavoidable i.e. even optimally designed classifiers will be vulnerable to small adversarial perturbations. We present derivations of the optimal adversarial attacks for discrete and continuous signals of interest, i.e., finding the optimal perturbation distributions to minimize the mutual information between the degraded signal and a signal following a continuous or discrete distribution. In addition, we show that it is much harder to achieve adversarial attacks for minimizing mutual information when multiple redundant copies of the input signal are available. This provides additional support to the recently proposed ``feature compression" hypothesis as an explanation for the adversarial vulnerability of deep learning classifiers. We also report on results from computational experiments to illustrate our theoretical results.Comment: 16 pages, 5 theorems, 6 figure

arXiv.org e-Print Archive