530 research outputs found

    SINVAD: Search-based Image Space Navigation for DNN Image Classifier Test Input Generation

    Full text link
    The testing of Deep Neural Networks (DNNs) has become increasingly important as DNNs are widely adopted by safety critical systems. While many test adequacy criteria have been suggested, automated test input generation for many types of DNNs remains a challenge because the raw input space is too large to randomly sample or to navigate and search for plausible inputs. Consequently, current testing techniques for DNNs depend on small local perturbations to existing inputs, based on the metamorphic testing principle. We propose new ways to search not over the entire image space, but rather over a plausible input space that resembles the true training distribution. This space is constructed using Variational Autoencoders (VAEs), and navigated through their latent vector space. We show that this space helps efficiently produce test inputs that can reveal information about the robustness of DNNs when dealing with realistic tests, opening the field to meaningful exploration through the space of highly structured images

    Input Prioritization for Testing Neural Networks

    Full text link
    Deep neural networks (DNNs) are increasingly being adopted for sensing and control functions in a variety of safety and mission-critical systems such as self-driving cars, autonomous air vehicles, medical diagnostics, and industrial robotics. Failures of such systems can lead to loss of life or property, which necessitates stringent verification and validation for providing high assurance. Though formal verification approaches are being investigated, testing remains the primary technique for assessing the dependability of such systems. Due to the nature of the tasks handled by DNNs, the cost of obtaining test oracle data---the expected output, a.k.a. label, for a given input---is high, which significantly impacts the amount and quality of testing that can be performed. Thus, prioritizing input data for testing DNNs in meaningful ways to reduce the cost of labeling can go a long way in increasing testing efficacy. This paper proposes using gauges of the DNN's sentiment derived from the computation performed by the model, as a means to identify inputs that are likely to reveal weaknesses. We empirically assessed the efficacy of three such sentiment measures for prioritization---confidence, uncertainty, and surprise---and compare their effectiveness in terms of their fault-revealing capability and retraining effectiveness. The results indicate that sentiment measures can effectively flag inputs that expose unacceptable DNN behavior. For MNIST models, the average percentage of inputs correctly flagged ranged from 88% to 94.8%

    TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using Mutation Analysis

    Full text link
    Successful deployment of Deep Neural Networks (DNNs), particularly in safety-critical systems, requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Mutation analysis, one of the main techniques for measuring test adequacy in traditional software, has been adapted to DNNs in recent years. This technique is based on generating mutants that aim to be representative of actual faults and thus can be used for test adequacy assessment. In this paper, we investigate for the first time whether mutation operators that directly modify the trained DNN model (i.e., post-training) can be used for reliably assessing the test inputs of DNNs. We propose and evaluate TEASMA, an approach based on post-training mutation for assessing the adequacy of DNN's test sets. In practice, TEASMA allows engineers to decide whether they will be able to trust test results and thus validate the DNN before its deployment. Based on a DNN model's training set, TEASMA provides a methodology to build accurate prediction models of the Fault Detection Rate (FDR) of a test set from its mutation score, thus enabling its assessment. Our large empirical evaluation, across multiple DNN models, shows that predicted FDR values have a strong linear correlation (R2 >= 0.94) with actual values. Consequently, empirical evidence suggests that TEASMA provides a reliable basis for confidently deciding whether to trust test results or improve the test set

    DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems

    Full text link
    Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.Comment: The 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE 2018

    Guiding the retraining of convolutional neural networks against adversarial inputs

    Get PDF
    Background: When using deep learning models, one of the most critical vulnerabilities is their exposure to adversarial inputs, which can cause wrong decisions (e.g., incorrect classification of an image) with minor perturbations. To address this vulnerability, it becomes necessary to retrain the affected model against adversarial inputs as part of the software testing process. In order to make this process energy efficient, data scientists need support on which are the best guidance metrics for reducing the adversarial inputs to create and use during testing, as well as optimal dataset configurations. Aim: We examined six guidance metrics for retraining deep learning models, specifically with convolutional neural network architecture, and three retraining configurations. Our goal is to improve the convolutional neural networks against the attack of adversarial inputs with regard to the accuracy, resource utilization and execution time from the point of view of a data scientist in the context of image classification. Method: We conducted an empirical study using five datasets for image classification. We explore: (a) the accuracy, resource utilization, and execution time of retraining convolutional neural networks with the guidance of six different guidance metrics (neuron coverage, likelihood-based surprise adequacy, distance-based surprise adequacy, DeepGini, softmax entropy and random), (b) the accuracy and resource utilization of retraining convolutional neural networks with three different configurations (one-step adversarial retraining, adversarial retraining and adversarial fine-tuning). Results: We reveal that adversarial retraining from original model weights, and by ordering with uncertainty metrics, gives the best model w.r.t. accuracy, resource utilization, and execution time. Conclusions: Although more studies are necessary, we recommend data scientists use the above configuration and metrics to deal with the vulnerability to adversarial inputs of deep learning models, as they can improve their models against adversarial inputs without using many inputs and without creating numerous adversarial inputs. We also show that dataset size has an important impact on the results.This work was supported by the GAISSA Spanish research project (ref. TED2021-130923B-I00; MCIN/AEI/10.13039/501100011033), the “UNAM-DGECI: Iniciación a la Investigación (verano otoño 2021)” scholarship provided by Universidad Nacional Autónoma de México (UNAM), the “Beatriz Galindo” Spanish Program BEAGAL18/00064, the Austrian Science Fund (FWF): I 4701-N and the project Continuous Testing in Production (ConTest) funded by the Austrian Research Promotion Agency (FFG): 888127.Peer ReviewedObjectius de Desenvolupament Sostenible::7 - Energia Assequible i No ContaminantObjectius de Desenvolupament Sostenible::13 - Acció per al ClimaPostprint (published version
    corecore