16 research outputs found
Interoperability in Deep Learning: A User Survey and Failure Analysis of ONNX Model Converters
Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies.This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful
DeepLocalize: Fault Localization for Deep Neural Networks
Deep neural networks (DNNs) are becoming an integral part of most software
systems. Previous work has shown that DNNs have bugs. Unfortunately, existing
debugging techniques do not support localizing DNN bugs because of the lack of
understanding of model behaviors. The entire DNN model appears as a black box.
To address these problems, we propose an approach that automatically determines
whether the model is buggy or not, and identifies the root causes. Our key
insight is that historic trends in values propagated between layers can be
analyzed to identify faults, and localize faults. To that end, we first enable
dynamic analysis of deep learning applications: by converting it into an
imperative representation and alternatively using a callback mechanism. Both
mechanisms allows us to insert probes that enable dynamic analysis over the
traces produced by the DNN while it is being trained on the training data. We
then conduct dynamic analysis over the traces to identify the faulty layer that
causes the error. We propose an algorithm for identifying root causes by
capturing any numerical error and monitoring the model during training and
finding the relevance of every layer on the DNN outcome. We have collected a
benchmark containing 40 buggy models and patches that contain real errors in
deep learning applications from Stack Overflow and GitHub. Our benchmark can be
used to evaluate automated debugging tools and repair techniques. We have
evaluated our approach using this DNN bug-and-patch benchmark, and the results
showed that our approach is much more effective than the existing debugging
approach used in the state of the practice Keras library. For 34 out of 40
cases, our approach was able to detect faults whereas the best debugging
approach provided by Keras detected 32 out of 40 faults. Our approach was able
to localize 21 out of 40 bugs whereas Keras did not localize any faults.Comment: Accepted at ICSE 202
Automatic Fault Detection for Deep Learning Programs Using Graph Transformations
Nowadays, we are witnessing an increasing demand in both corporates and
academia for exploiting Deep Learning (DL) to solve complex real-world
problems. A DL program encodes the network structure of a desirable DL model
and the process by which the model learns from the training dataset. Like any
software, a DL program can be faulty, which implies substantial challenges of
software quality assurance, especially in safety-critical domains. It is
therefore crucial to equip DL development teams with efficient fault detection
techniques and tools. In this paper, we propose NeuraLint, a model-based fault
detection approach for DL programs, using meta-modelling and graph
transformations. First, we design a meta-model for DL programs that includes
their base skeleton and fundamental properties. Then, we construct a
graph-based verification process that covers 23 rules defined on top of the
meta-model and implemented as graph transformations to detect faults and design
inefficiencies in the generated models (i.e., instances of the meta-model).
First, the proposed approach is evaluated by finding faults and design
inefficiencies in 28 synthesized examples built from common problems reported
in the literature. Then NeuraLint successfully finds 64 faults and design
inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts
and GitHub repositories. The results show that NeuraLint effectively detects
faults and design issues in both synthesized and real-world examples with a
recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is
designed for feedforward neural networks, it can be extended to support other
neural network architectures such as recurrent neural networks. Researchers can
also expand our set of verification rules to cover more types of issues in DL
programs
An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems
Machine Learning (ML), including Deep Learning (DL), systems, i.e., those with ML capabilities, are pervasive in today’s data-driven society. Such systems are complex; they are comprised of ML models and many subsystems that support learning processes. As with other complex systems, ML systems are prone to classic technical debt issues, especially when such systems are long-lived, but they also exhibit debt specific to these systems. Unfortunately, there is a gap of knowledge in how ML systems actually evolve and are maintained. In this paper, we fill this gap by studying refactorings, i.e., source-to-source semantics-preserving program transformations, performed in real-world, open-source software, and the technical debt issues they alleviate. We analyzed 26 projects, consisting of 4.2 MLOC, along with 327 manually examined code patches. The results indicate that developers refactor these systems for a variety of reasons, both specific and tangential to ML, some refactorings correspond to established technical debt categories, while others do not, and code duplication is a major cross-cutting theme that particularly involved ML configuration and model code, which was also the most refactored. We also introduce 14 and 7 new ML-specific refactorings and technical debt categories, respectively, and put forth several recommendations, best practices, and anti-patterns. The results can potentially assist practitioners, tool developers, and educators in facilitating long-term ML system usefulness