Detect and Repair Errors for DNN-based Software

Abstract

Nowadays, deep neural networks based software have been widely applied in many areas including safety-critical areas such as traffic control, medical diagnosis and malware detection, etc. However, the software engineering techniques, which are supposed to guarantee the functionality, safety as well as fairness, are not well studied. For example, some serious crashes of DNN based autonomous cars have been reported. These crashes could have been avoided if these DNN based software were well tested. Traditional software testing, debugging or repairing techniques do not work well on DNN based software because there is no control flow, data flow or AST(Abstract Syntax Tree) in deep neural networks. Proposing software engineering techniques targeted on DNN based software are imperative. In this thesis, we first introduced the development of SE(Software Engineering) for AI(Artificial Intelligence) area and how our works have influenced the advancement of this new area. Then we summarized related works and some important concepts in SE for AI area. Finally, we discussed four important works of ours. Our first project DeepTest is one of the first few papers proposing systematic software testing techniques for DNN based software. We proposed neuron coverage guided image synthesis techniques for DNN based autonomous cars and leveraged domain specific metamorphic relation to generate oracle for new generated test cases to automatically test DNN based software. We applied DeepTest to testing three top performing self-driving car models in Udacity self-driving car challenge and our tool has identified thousands of erroneous behaviors that may lead to potential fatal crash. In DeepTest project, we found that the natural variation such as spatial transformations or rain/fog effects have led to problematic corner cases for DNN based self-driving cars. In the follow-up project DeepRobust, we studied per-point robustness of deep neural network under natural variation. We found that for a DNN model, some specific weak points are more likely to cause erroneous outputs than others under natural variation. We proposed a white-box approach and a black-box approach to identify these weak data points. We implemented and evaluated our approaches on 9 DNN based image classifiers and 3 DNN based self-driving car models. Our approaches can successfully detect weak points with good precision and recall for both DNN based image classifiers and self-driving cars. Most of existing works in SE for AI area including our DeepTest and DeepRobust focus on instance-wise errors, which are single inputs that result in a DNN model's erroneous outputs. Different from instance-wise errors, group-level errors reflect a DNN model's weak performance on differentiating among certain classes or inconsistent performance across classes. This type of errors is very concerning since it has been found to be related to many real-world notorious errors without malicious attackers. In our third project DeepInspect, we first introduced the group-level errors for DNN based software and categorized them into confusion errors and bias errors based on real-world reports. Then we proposed neuron coverage based distance metric to detect group-level errors for DNN based software without requiring labels. We applied DeepInspect to testing 8 pretrained DNN models trained in 6 popular image classification datasets, including three adversarial trained models. We showed that DeepInspect can successfully detect group-level violations for both single-label and multi-label classification models with high precision. As a follow-up and more challenging research project, we proposed five WR(weighted regularization) techniques to repair group-level errors for DNN based software. These five different weighted regularization techniques function at different stages of retraining or inference of DNNs including input phase, layer phase, loss phase and output phase. We compared and evaluated these five different WR techniques in both single-label and multi-label classifications including five combinations of four DNN architectures on four datasets. We showed that WR can effectively fix confusion and bias errors and these methods all have their pros, cons and applicable scenario. All our four projects discussed in this thesis have solved important problems in ensuring the functionality, safety as well as fairness for DNN based software and had significant influence in the advancement of SE for AI area

    Similar works