11 research outputs found
Boosting Operational DNN Testing Efficiency through Conditioning
With the increasing adoption of Deep Neural Network (DNN) models as integral
parts of software systems, efficient operational testing of DNNs is much in
demand to ensure these models' actual performance in field conditions. A
challenge is that the testing often needs to produce precise results with a
very limited budget for labeling data collected in field.
Viewing software testing as a practice of reliability estimation through
statistical sampling, we re-interpret the idea behind conventional structural
coverages as conditioning for variance reduction. With this insight we propose
an efficient DNN testing method based on the conditioning on the representation
learned by the DNN model under testing. The representation is defined by the
probability distribution of the output of neurons in the last hidden layer of
the model. To sample from this high dimensional distribution in which the
operational data are sparsely distributed, we design an algorithm leveraging
cross entropy minimization.
Experiments with various DNN models and datasets were conducted to evaluate
the general efficiency of the approach. The results show that, compared with
simple random sampling, this approach requires only about a half of labeled
inputs to achieve the same level of precision.Comment: Published in the Proceedings of the 27th ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE 2019
Operational Calibration: Debugging Confidence Errors for DNNs in the Field
Trained DNN models are increasingly adopted as integral parts of software
systems, but they often perform deficiently in the field. A particularly
damaging problem is that DNN models often give false predictions with high
confidence, due to the unavoidable slight divergences between operation data
and training data. To minimize the loss caused by inaccurate confidence,
operational calibration, i.e., calibrating the confidence function of a DNN
classifier against its operation domain, becomes a necessary debugging step in
the engineering of the whole system.
Operational calibration is difficult considering the limited budget of
labeling operation data and the weak interpretability of DNN models. We propose
a Bayesian approach to operational calibration that gradually corrects the
confidence given by the model under calibration with a small number of labeled
operation data deliberately selected from a larger set of unlabeled operation
data. The approach is made effective and efficient by leveraging the locality
of the learned representation of the DNN model and modeling the calibration as
Gaussian Process Regression. Comprehensive experiments with various practical
datasets and DNN models show that it significantly outperformed alternative
methods, and in some difficult tasks it eliminated about 71% to 97%
high-confidence (>0.9) errors with only about 10\% of the minimal amount of
labeled operation data needed for practical learning techniques to barely work.Comment: Published in the Proceedings of the 28th ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE 2020
Test & Evaluation Best Practices for Machine Learning-Enabled Systems
Machine learning (ML) - based software systems are rapidly gaining adoption
across various domains, making it increasingly essential to ensure they perform
as intended. This report presents best practices for the Test and Evaluation
(T&E) of ML-enabled software systems across its lifecycle. We categorize the
lifecycle of ML-enabled software systems into three stages: component,
integration and deployment, and post-deployment. At the component level, the
primary objective is to test and evaluate the ML model as a standalone
component. Next, in the integration and deployment stage, the goal is to
evaluate an integrated ML-enabled system consisting of both ML and non-ML
components. Finally, once the ML-enabled software system is deployed and
operationalized, the T&E objective is to ensure the system performs as
intended. Maintenance activities for ML-enabled software systems span the
lifecycle and involve maintaining various assets of ML-enabled software
systems.
Given its unique characteristics, the T&E of ML-enabled software systems is
challenging. While significant research has been reported on T&E at the
component level, limited work is reported on T&E in the remaining two stages.
Furthermore, in many cases, there is a lack of systematic T&E strategies
throughout the ML-enabled system's lifecycle. This leads practitioners to
resort to ad-hoc T&E practices, which can undermine user confidence in the
reliability of ML-enabled software systems. New systematic testing approaches,
adequacy measurements, and metrics are required to address the T&E challenges
across all stages of the ML-enabled system lifecycle
DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural Networks
Deep neural networks (DNNs) are widely used in various application domains
such as image processing, speech recognition, and natural language processing.
However, testing DNN models may be challenging due to the complexity and size
of their input domain. Particularly, testing DNN models often requires
generating or exploring large unlabeled datasets. In practice, DNN test
oracles, which identify the correct outputs for inputs, often require expensive
manual effort to label test data, possibly involving multiple experts to ensure
labeling correctness. In this paper, we propose DeepGD, a black-box
multi-objective test selection approach for DNN models. It reduces the cost of
labeling by prioritizing the selection of test inputs with high fault revealing
power from large unlabeled datasets. DeepGD not only selects test inputs with
high uncertainty scores to trigger as many mispredicted inputs as possible but
also maximizes the probability of revealing distinct faults in the DNN model by
selecting diverse mispredicted inputs. The experimental results conducted on
four widely used datasets and five DNN models show that in terms of
fault-revealing ability: (1) White-box, coverage-based approaches fare poorly,
(2) DeepGD outperforms existing black-box test selection approaches in terms of
fault detection, and (3) DeepGD also leads to better guidance for DNN model
retraining when using selected inputs to augment the training set