Deep Neural Networks (DNNs) are becoming a crucial component of modern
software systems, but they are prone to fail under conditions that are
different from the ones observed during training (out-of-distribution inputs)
or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes
with nonzero probability in their labels. Recent work proposed DNN supervisors
to detect high-uncertainty inputs before their possible misclassification leads
to any harm. To test and compare the capabilities of DNN supervisors,
researchers proposed test generation techniques, to focus the testing effort on
high-uncertainty inputs that should be recognized as anomalous by supervisors.
However, existing test generators aim to produce out-of-distribution inputs. No
existing model- and supervisor independent technique targets the generation of
truly ambiguous test inputs, i.e., inputs that admit multiple classes according
to expert human judgment.
In this paper, we propose a novel way to generate ambiguous inputs to test
DNN supervisors and used it to empirically compare several existing supervisor
techniques. In particular, we propose AmbiGuess to generate ambiguous samples
for image classification problems. AmbiGuess is based on gradient-guided
sampling in the latent space of a regularized adversarial autoencoder.
Moreover, we conducted what is -- to the best of our knowledge -- the most
extensive comparative study of DNN supervisors, considering their capabilities
to detect 4 distinct types of high-uncertainty inputs, including truly
ambiguous ones. We find that the tested supervisors' capabilities are
complementary: Those best suited to detect true ambiguity perform worse on
invalid, out-of-distribution and adversarial inputs and vice-versa.Comment: Accepted for publication at Springers "Empirical Software
Engineering" (EMSE