15 research outputs found
Testing Feedforward Neural Networks Training Programs
Nowadays, we are witnessing an increasing effort to improve the performance
and trustworthiness of Deep Neural Networks (DNNs), with the aim to enable
their adoption in safety critical systems such as self-driving cars. Multiple
testing techniques are proposed to generate test cases that can expose
inconsistencies in the behavior of DNN models. These techniques assume
implicitly that the training program is bug-free and appropriately configured.
However, satisfying this assumption for a novel problem requires significant
engineering work to prepare the data, design the DNN, implement the training
program, and tune the hyperparameters in order to produce the model for which
current automated test data generators search for corner-case behaviors. All
these model training steps can be error-prone. Therefore, it is crucial to
detect and correct errors throughout all the engineering steps of DNN-based
software systems and not only on the resulting DNN model. In this paper, we
gather a catalog of training issues and based on their symptoms and their
effects on the behavior of the training program, we propose practical
verification routines to detect the aforementioned issues, automatically, by
continuously validating that some important properties of the learning dynamics
hold during the training. Then, we design, TheDeepChecker, an end-to-end
property-based debugging approach for DNN training programs. We assess the
effectiveness of TheDeepChecker on synthetic and real-world buggy DL programs
and compare it with Amazon SageMaker Debugger (SMD). Results show that
TheDeepChecker's on-execution validation of DNN-based program's properties
succeeds in revealing several coding bugs and system misconfigurations, early
on and at a low cost. Moreover, TheDeepChecker outperforms the SMD's offline
rules verification on training logs in terms of detection accuracy and DL bugs
coverage
TFCheck : A TensorFlow Library for Detecting Training Issues in Neural Network Programs
The increasing inclusion of Machine Learning (ML) models in safety critical
systems like autonomous cars have led to the development of multiple
model-based ML testing techniques. One common denominator of these testing
techniques is their assumption that training programs are adequate and
bug-free. These techniques only focus on assessing the performance of the
constructed model using manually labeled data or automatically generated data.
However, their assumptions about the training program are not always true as
training programs can contain inconsistencies and bugs. In this paper, we
examine training issues in ML programs and propose a catalog of verification
routines that can be used to detect the identified issues, automatically. We
implemented the routines in a Tensorflow-based library named TFCheck. Using
TFCheck, practitioners can detect the aforementioned issues automatically. To
assess the effectiveness of TFCheck, we conducted a case study with real-world,
mutants, and synthetic training programs. Results show that TFCheck can
successfully detect training issues in ML code implementations
DeepEvolution: A Search-Based Testing Approach for Deep Neural Networks
The increasing inclusion of Deep Learning (DL) models in safety-critical
systems such as autonomous vehicles have led to the development of multiple
model-based DL testing techniques. One common denominator of these testing
techniques is the automated generation of test cases, e.g., new inputs
transformed from the original training data with the aim to optimize some test
adequacy criteria. So far, the effectiveness of these approaches has been
hindered by their reliance on random fuzzing or transformations that do not
always produce test cases with a good diversity. To overcome these limitations,
we propose, DeepEvolution, a novel search-based approach for testing DL models
that relies on metaheuristics to ensure a maximum diversity in generated test
cases. We assess the effectiveness of DeepEvolution in testing computer-vision
DL models and found that it significantly increases the neuronal coverage of
generated test cases. Moreover, using DeepEvolution, we could successfully find
several corner-case behaviors. Finally, DeepEvolution outperformed Tensorfuzz
(a coverage-guided fuzzing tool developed at Google Brain) in detecting latent
defects introduced during the quantization of the models. These results suggest
that search-based approaches can help build effective testing tools for DL
systems
Towards Debugging and Testing Deep Learning Systems
Au cours des dernières années, l’apprentissage profond, en anglais Deep Learning (DL) a fait d’énormes progrès, en atteignant et dépassant même parfois le niveau de performance des humains pour différentes tâches, telles que la classification des images et la reconnaissance vocale. Grâce à ces progrès, nous constatons une large adoption du DL dans des applications critiques, telles que la conduite autonome de véhicules, la prévention et la détection du
crime, et le traitement médical. Cependant, malgré leurs progrès spectaculaires, les systèmes de DL, tout comme les logiciels traditionnels, présentent souvent des comportements erronés en raison de l’existence de défauts cachés ou d’inefficacités. Ces comportements erronés
peuvent être à l’origine d’accidents catastrophiques. Ainsi, l’assurance de la qualité des logiciels (SQA), y compris la fiabilité et la robustesse, pour les systèmes de DL devient une préoccupation majeure. Les tests traditionnels pour les modèles de DL consistent à mesurer leurs performances sur des données collectées manuellement ; ils dépendent donc fortement de la qualité des données de test qui, souvent, n’incluent pas de données d’entrée rares, comme en
témoignent les récents accidents de voitures avec conduite autonome (exemple Tesla/Uber). Les techniques de test avancées sont très demandées pour améliorer la fiabilité des systèmes de DL. Néanmoins, les tests des systèmes de DL posent des défis importants, en raison de leur nature non-déterministe puisqu’ils suivent un paradigme axé sur les données (la tâche cible est apprise statistiquement) et leur manque d’oracle puisqu’ils sont conçus principalement
pour fournir la réponse. Récemment, les chercheurs en génie logiciel ont commencé à adapter des concepts du domaine du test logiciel tels que la couverture des cas de tests et
les pseudo-oracles, pour résoudre ces difficultés. Malgré les résultats prometteurs obtenus de cette rénovation des méthodes existantes de test logiciel, le domaine du test des systèmes de DL est encore immature et les méthodes proposées à ce jour ne sont pas très efficaces. Dans ce mémoire, nous examinons les solutions existantes proposées pour tester les systèmes de DL et proposons quelques nouvelles techniques. Nous réalisons cet objectif en suivant une approche systématique qui consiste à : (1) étudier les problèmes et les défis liés aux tests des logiciels de DL; (2) souligner les forces et les faiblesses des techniques de test logiciel adaptées aux systèmes de DL; (3) proposer de nouvelles solutions de test pour combler certaines lacunes identifiées dans la littérature, et potentiellement aider à améliorer l’assurance qualité des systèmes de DL.----------ABSTRACT: Over the past few years, Deep Learning (DL) has made tremendous progress, achieving or surpassing human-level performance for different tasks such as image classification and speech recognition. Thanks to these advances, we are witnessing a wide adoption of DL in safetycritical applications such as autonomous driving cars, crime prevention and detection, and medical treatment. However, despite their spectacular progress, DL systems, just like traditional software systems, often exhibit erroneous corner-cases behaviors due to the existence of
latent defects or inefficiencies, and which can lead to catastrophic accidents. Thus, software quality assurance (SQA), including reliability and robustness, for DL systems becomes a big concern. Traditional testing for DL models consists of measuring their performance on manually collected data ; so it heavily depends on the quality of the test data that often fails to include rare inputs, as evidenced by recent autonomous-driving car accidents (e.g., Tesla/Uber). Advanced testing techniques are in high demand to improve the trustworthiness of DL systems. Nevertheless, DL testing poses significant challenges stemming from the non-deterministic nature of DL systems (since they follow a data-driven paradigm ; the target task is learned
statistically) and their lack of oracle (since they are designed principally to provide the answer). Recently, software researchers have started adapting concepts from the software testing domain such as test coverage and pseudo-oracles to tackle these difficulties. Despite some
promising results obtained from adapting existing software testing methods, current software testing techniques for DL systems are still quite immature. In this thesis, we examine existing testing techniques for DL systems and propose some new techniques. We achieve this by following a systematic approach consisting of : (1) investigating DL software issues and testing challenges ; (2) outlining the strengths and weaknesses of the software-based testing techniques adapted for DL systems ; and (3) proposing novel testing solutions to fill some of the identified literature gaps, and potentially help improving the SQA of DL systems
Physics-Guided Adversarial Machine Learning for Aircraft Systems Simulation
In the context of aircraft system performance assessment, deep learning
technologies allow to quickly infer models from experimental measurements, with
less detailed system knowledge than usually required by physics-based modeling.
However, this inexpensive model development also comes with new challenges
regarding model trustworthiness. This work presents a novel approach,
physics-guided adversarial machine learning (ML), that improves the confidence
over the physics consistency of the model. The approach performs, first, a
physics-guided adversarial testing phase to search for test inputs revealing
behavioral system inconsistencies, while still falling within the range of
foreseeable operational conditions. Then, it proceeds with physics-informed
adversarial training to teach the model the system-related physics domain
foreknowledge through iteratively reducing the unwanted output deviations on
the previously-uncovered counterexamples. Empirical evaluation on two aircraft
system performance models shows the effectiveness of our adversarial ML
approach in exposing physical inconsistencies of both models and in improving
their propensity to be consistent with physics domain knowledge
Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach
A growing demand is witnessed in both industry and academia for employing
Deep Learning (DL) in various domains to solve real-world problems. Deep
Reinforcement Learning (DRL) is the application of DL in the domain of
Reinforcement Learning (RL). Like any software systems, DRL applications can
fail because of faults in their programs. In this paper, we present the first
attempt to categorize faults occurring in DRL programs. We manually analyzed
761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues)
developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl,
Tensorforce) and identified faults reported by developers/users. We labeled and
taxonomized the identified faults through several rounds of discussions. The
resulting taxonomy is validated using an online survey with 19
developers/researchers. To allow for the automatic detection of faults in DRL
programs, we have defined a meta-model of DRL programs and developed DRLinter,
a model-based fault detection approach that leverages static analysis and graph
transformations. The execution flow of DRLinter consists in parsing a DRL
program to generate a model conforming to our meta-model and applying detection
rules on the model to identify faults occurrences. The effectiveness of
DRLinter is evaluated using 15 synthetic DRLprograms in which we injected
faults observed in the analyzed artifacts of the taxonomy. The results show
that DRLinter can successfully detect faults in all synthetic faulty programs
Automatic Fault Detection for Deep Learning Programs Using Graph Transformations
Nowadays, we are witnessing an increasing demand in both corporates and
academia for exploiting Deep Learning (DL) to solve complex real-world
problems. A DL program encodes the network structure of a desirable DL model
and the process by which the model learns from the training dataset. Like any
software, a DL program can be faulty, which implies substantial challenges of
software quality assurance, especially in safety-critical domains. It is
therefore crucial to equip DL development teams with efficient fault detection
techniques and tools. In this paper, we propose NeuraLint, a model-based fault
detection approach for DL programs, using meta-modelling and graph
transformations. First, we design a meta-model for DL programs that includes
their base skeleton and fundamental properties. Then, we construct a
graph-based verification process that covers 23 rules defined on top of the
meta-model and implemented as graph transformations to detect faults and design
inefficiencies in the generated models (i.e., instances of the meta-model).
First, the proposed approach is evaluated by finding faults and design
inefficiencies in 28 synthesized examples built from common problems reported
in the literature. Then NeuraLint successfully finds 64 faults and design
inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts
and GitHub repositories. The results show that NeuraLint effectively detects
faults and design issues in both synthesized and real-world examples with a
recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is
designed for feedforward neural networks, it can be extended to support other
neural network architectures such as recurrent neural networks. Researchers can
also expand our set of verification rules to cover more types of issues in DL
programs
SmOOD: Smoothness-based Out-of-Distribution Detection Approach for Surrogate Neural Networks in Aircraft Design
Aircraft industry is constantly striving for more efficient design
optimization methods in terms of human efforts, computation time, and resource
consumption. Hybrid surrogate optimization maintains high results quality while
providing rapid design assessments when both the surrogate model and the switch
mechanism for eventually transitioning to the HF model are calibrated properly.
Feedforward neural networks (FNNs) can capture highly nonlinear input-output
mappings, yielding efficient surrogates for aircraft performance factors.
However, FNNs often fail to generalize over the out-of-distribution (OOD)
samples, which hinders their adoption in critical aircraft design optimization.
Through SmOOD, our smoothness-based out-of-distribution detection approach, we
propose to codesign a model-dependent OOD indicator with the optimized FNN
surrogate, to produce a trustworthy surrogate model with selective but credible
predictions. Unlike conventional uncertainty-grounded methods, SmOOD exploits
inherent smoothness properties of the HF simulations to effectively expose OODs
through revealing their suspicious sensitivities, thereby avoiding
over-confident uncertainty estimates on OOD samples. By using SmOOD, only
high-risk OOD inputs are forwarded to the HF model for re-evaluation, leading
to more accurate results at a low overhead cost. Three aircraft performance
models are investigated. Results show that FNN-based surrogates outperform
their Gaussian Process counterparts in terms of predictive performance.
Moreover, SmOOD does cover averagely 85% of actual OODs on all the study cases.
When SmOOD plus FNN surrogates are deployed in hybrid surrogate optimization
settings, they result in a decrease error rate of 34.65% and a computational
speed up rate of 58.36 times, respectively
DiverGet: A Search-Based Software Testing Approach for Deep Neural Network Quantization Assessment
Quantization is one of the most applied Deep Neural Network (DNN) compression
strategies, when deploying a trained DNN model on an embedded system or a cell
phone. This is owing to its simplicity and adaptability to a wide range of
applications and circumstances, as opposed to specific Artificial Intelligence
(AI) accelerators and compilers that are often designed only for certain
specific hardware (e.g., Google Coral Edge TPU). With the growing demand for
quantization, ensuring the reliability of this strategy is becoming a critical
challenge. Traditional testing methods, which gather more and more genuine data
for better assessment, are often not practical because of the large size of the
input space and the high similarity between the original DNN and its quantized
counterpart. As a result, advanced assessment strategies have become of
paramount importance. In this paper, we present DiverGet, a search-based
testing framework for quantization assessment. DiverGet defines a space of
metamorphic relations that simulate naturally-occurring distortions on the
inputs. Then, it optimally explores these relations to reveal the disagreements
among DNNs of different arithmetic precision. We evaluate the performance of
DiverGet on state-of-the-art DNNs applied to hyperspectral remote sensing
images. We chose the remote sensing DNNs as they're being increasingly deployed
at the edge (e.g., high-lift drones) in critical domains like climate change
research and astronomy. Our results show that DiverGet successfully challenges
the robustness of established quantization techniques against
naturally-occurring shifted data, and outperforms its most recent concurrent,
DiffChaser, with a success rate that is (on average) four times higher.Comment: Accepted for publication in The Empirical Software Engineering
Journal (EMSE