Deep neural networks are vulnerable to backdoor attacks, where an adversary
maliciously manipulates the model behavior through overlaying images with
special triggers. Existing backdoor defense methods often require accessing a
few validation data and model parameters, which are impractical in many
real-world applications, e.g., when the model is provided as a cloud service.
In this paper, we address the practical task of blind backdoor defense at test
time, in particular for black-box models. The true label of every test image
needs to be recovered on the fly from the hard label predictions of a
suspicious model. The heuristic trigger search in image space, however, is not
scalable to complex triggers or high image resolution. We circumvent such
barrier by leveraging generic image generation models, and propose a framework
of Blind Defense with Masked AutoEncoder (BDMAE). It uses the image structural
similarity and label consistency between the test image and MAE restorations to
detect possible triggers. The detection result is refined by considering the
topology of triggers. We obtain a purified test image from restorations for
making prediction. Our approach is blind to the model architectures, trigger
patterns or image benignity. Extensive experiments on multiple datasets with
different backdoor attacks validate its effectiveness and generalizability.
Code is available at https://github.com/tsun/BDMAE