137 research outputs found
Identifying Implementation Bugs in Machine Learning based Image Classifiers using Metamorphic Testing
We have recently witnessed tremendous success of Machine Learning (ML) in
practical applications. Computer vision, speech recognition and language
translation have all seen a near human level performance. We expect, in the
near future, most business applications will have some form of ML. However,
testing such applications is extremely challenging and would be very expensive
if we follow today's methodologies. In this work, we present an articulation of
the challenges in testing ML based applications. We then present our solution
approach, based on the concept of Metamorphic Testing, which aims to identify
implementation bugs in ML based image classifiers. We have developed
metamorphic relations for an application based on Support Vector Machine and a
Deep Learning based application. Empirical validation showed that our approach
was able to catch 71% of the implementation bugs in the ML applications.Comment: Published at 27th ACM SIGSOFT International Symposium on Software
Testing and Analysis (ISSTA 2018
Fault Detection Effectiveness of Metamorphic Relations Developed for Testing Supervised Classifiers
In machine learning, supervised classifiers are used to obtain predictions
for unlabeled data by inferring prediction functions using labeled data.
Supervised classifiers are widely applied in domains such as computational
biology, computational physics and healthcare to make critical decisions.
However, it is often hard to test supervised classifiers since the expected
answers are unknown. This is commonly known as the \emph{oracle problem} and
metamorphic testing (MT) has been used to test such programs. In MT,
metamorphic relations (MRs) are developed from intrinsic characteristics of the
software under test (SUT). These MRs are used to generate test data and to
verify the correctness of the test results without the presence of a test
oracle. Effectiveness of MT heavily depends on the MRs used for testing. In
this paper we have conducted an extensive empirical study to evaluate the fault
detection effectiveness of MRs that have been used in multiple previous studies
to test supervised classifiers. Our study uses a total of 709 reachable mutants
generated by multiple mutation engines and uses data sets with varying
characteristics to test the SUT. Our results reveal that only 14.8\% of these
mutants are detected using the MRs and that the fault detection effectiveness
of these MRs do not scale with the increased number of mutants when compared to
what was reported in previous studies.Comment: 8 pages, AITesting 201
Specifying and Testing -Safety Properties for Machine-Learning Models
Machine-learning models are becoming increasingly prevalent in our lives, for instance assisting in image-classification or decision-making tasks. Consequently, the reliability of these models is of critical importance and has resulted in the development of numerous approaches for validating and verifying their robustness and fairness. However, beyond such specific properties, it is challenging to specify, let alone check, general functional-correctness expectations from models. In this paper, we take inspiration from specifications used in formal methods, expressing functional-correctness properties by reasoning about different executions, so-called -safety properties. Considering a credit-screening model of a bank, the expected property that "if a person is denied a loan and their income decreases, they should still be denied the loan" is a 2-safety property. Here, we show the wide applicability of -safety properties for machine-learning models and present the first specification language for expressing them. We also operationalize the language in a framework for automatically validating such properties using metamorphic testing. Our experiments show that our framework is effective in identifying property violations, and that detected bugs could be used to train better models
DeepSaucer: Unified Environment for Verifying Deep Neural Networks
In recent years, a number of methods for verifying DNNs have been developed.
Because the approaches of the methods differ and have their own limitations, we
think that a number of verification methods should be applied to a developed
DNN. To apply a number of methods to the DNN, it is necessary to translate
either the implementation of the DNN or the verification method so that one
runs in the same environment as the other. Since those translations are
time-consuming, a utility tool, named DeepSaucer, which helps to retain and
reuse implementations of DNNs, verification methods, and their environments, is
proposed. In DeepSaucer, code snippets of loading DNNs, running verification
methods, and creating their environments are retained and reused as software
assets in order to reduce cost of verifying DNNs. The feasibility of DeepSaucer
is confirmed by implementing it on the basis of Anaconda, which provides
virtual environment for loading a DNN and running a verification method. In
addition, the effectiveness of DeepSaucer is demonstrated by usecase examples
Test & Evaluation Best Practices for Machine Learning-Enabled Systems
Machine learning (ML) - based software systems are rapidly gaining adoption
across various domains, making it increasingly essential to ensure they perform
as intended. This report presents best practices for the Test and Evaluation
(T&E) of ML-enabled software systems across its lifecycle. We categorize the
lifecycle of ML-enabled software systems into three stages: component,
integration and deployment, and post-deployment. At the component level, the
primary objective is to test and evaluate the ML model as a standalone
component. Next, in the integration and deployment stage, the goal is to
evaluate an integrated ML-enabled system consisting of both ML and non-ML
components. Finally, once the ML-enabled software system is deployed and
operationalized, the T&E objective is to ensure the system performs as
intended. Maintenance activities for ML-enabled software systems span the
lifecycle and involve maintaining various assets of ML-enabled software
systems.
Given its unique characteristics, the T&E of ML-enabled software systems is
challenging. While significant research has been reported on T&E at the
component level, limited work is reported on T&E in the remaining two stages.
Furthermore, in many cases, there is a lack of systematic T&E strategies
throughout the ML-enabled system's lifecycle. This leads practitioners to
resort to ad-hoc T&E practices, which can undermine user confidence in the
reliability of ML-enabled software systems. New systematic testing approaches,
adequacy measurements, and metrics are required to address the T&E challenges
across all stages of the ML-enabled system lifecycle
- …