Machine learning has demonstrated remarkable performance over finite
datasets, yet whether the scores over the fixed benchmarks can sufficiently
indicate the model's performance in the real world is still in discussion. In
reality, an ideal robust model will probably behave similarly to the oracle
(e.g., the human users), thus a good evaluation protocol is probably to
evaluate the models' behaviors in comparison to the oracle. In this paper, we
introduce a new robustness measurement that directly measures the image
classification model's performance compared with a surrogate oracle (i.e., a
foundation model). Besides, we design a simple method that can accomplish the
evaluation beyond the scope of the benchmarks. Our method extends the image
datasets with new samples that are sufficiently perturbed to be distinct from
the ones in the original sets, but are still bounded within the same
image-label structure the original test image represents, constrained by a
foundation model pretrained with a large amount of samples. As a result, our
new method will offer us a new way to evaluate the models' robustness
performance, free of limitations of fixed benchmarks or constrained
perturbations, although scoped by the power of the oracle. In addition to the
evaluation results, we also leverage our generated data to understand the
behaviors of the model and our new evaluation strategies