Testing deep neural networks across different computational configurations

Abstract

Deep Neural Networks (DNNs) typically consist of complex architectures and require enormous processing power. Consequently, developers and researchers use Deep Learning (DL) frameworks to build them (e.g., Keras and PyTorch), apply compiler optimizations to improve their inference time performance (e.g., constant folding and operator fusion), and deploy them on hardware accelerators to parallelize their computations (e.g., GPUs and TPUs). We concisely refer to these aspects as the computational environment of Deep Neural Networks. However, the extent to which the behavior of a DNN model (i.e., output label inference correctness and computation times) is affected when different configurations are selected across the computational environment, is overlooked in the literature. For example, if a DNN model is deployed on two different GPU devices, will it give the same predictions, and how will its computation times deviate across the devices? Given that DNNs are deployed on safety-critical domains (e.g., autonomous driving), it is important to understand the extent to which DNNs are affected by these aspects. For that purpose, we present DeltaNN, a tool that allows DNN model compilation and deployment under different configurations, as well as comparison of model behavior across them. Using DeltaNN, we conducted a set of experiments on widely used Convolutional Neural Network (CNN) models performing image classification. We built these models using different DL frameworks, converted them across different DL framework configurations, compiled on a set of optimizations and deployed on GPU devices of varying capabilities. Our experiments with different configurations led to two main observations: (1) while DNNs typically generate the same predictions across different GPU devices and compiler optimization settings, this is not true when utilizing different DL frameworks, and especially when converting from one DL framework to another (e.g., converting from Keras to PyTorch), a common practice across developers to enable model portability and extensibility; and (2) optimizations are not a panacea of inference time improvement across different devices, as the same optimization strategies that improve execution times on high-end GPUs were found to degrade them when applied on models deployed on low-end GPUs. To mitigate the faults related to the conversion process, we implemented a framework called FetaFix. FetaFix performs automatic fault detection by comparing a number of aspects across the source and the converted target DNN model, such as model parameters, hyperparameters and structure. It then applies a number of fault repair strategies related to these aspects and checks how the converted model performs in comparison to its source counterpart. FetaFix was able to repair 93% of the problematic cases identified by DeltaNN. Finally, we explored the effects of faults present in the target hardware acceleration device code towards DNN model correctness. Inspired by traditional mutation testing, we built MutateNN, a tool that generates DNN model mutants containing target device code faults. We then generated a number of faults in the target device code of numerous CNN models performing classification and evaluated how these models behaved across different hardware acceleration devices. We observed that faults related to conditional operations, as well as drastic changes in arithmetic types, considerably affected model correctness. We conclude that different configurations of computational environment aspects can affect DNN model behavior. Our contributions summarize to (1) an empirical study on how the computational environment affects DNN model behavior, performed by a tool (DeltaNN) implemented specifically for that purpose, (2) a framework (FetaFix) that automatically detects faults related to model input, structure and parameters in converted DNN models across DL frameworks and repairs them, and (3) a utility (MutateNN) that introduces faults in the target code of DNN models associated with deployment on different hardware acceleration devices, and evaluates the effects of these faults on model correctness

Similar works

This paper was published in Edinburgh Research Archive.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.