1,177 research outputs found
The Integration of Machine Learning into Automated Test Generation: A Systematic Mapping Study
Context: Machine learning (ML) may enable effective automated test
generation.
Objective: We characterize emerging research, examining testing practices,
researcher goals, ML techniques applied, evaluation, and challenges.
Methods: We perform a systematic mapping on a sample of 102 publications.
Results: ML generates input for system, GUI, unit, performance, and
combinatorial testing or improves the performance of existing generation
methods. ML is also used to generate test verdicts, property-based, and
expected output oracles. Supervised learning - often based on neural networks -
and reinforcement learning - often based on Q-learning - are common, and some
publications also employ unsupervised or semi-supervised learning.
(Semi-/Un-)Supervised approaches are evaluated using both traditional testing
metrics and ML-related metrics (e.g., accuracy), while reinforcement learning
is often evaluated using testing metrics tied to the reward function.
Conclusion: Work-to-date shows great promise, but there are open challenges
regarding training data, retraining, scalability, evaluation complexity, ML
algorithms employed - and how they are applied - benchmarks, and replicability.
Our findings can serve as a roadmap and inspiration for researchers in this
field.Comment: Under submission to Software Testing, Verification, and Reliability
journal. (arXiv admin note: text overlap with arXiv:2107.00906 - This is an
earlier study that this study extends
Fairness Testing: A Comprehensive Survey and Analysis of Trends
Unfair behaviors of Machine Learning (ML) software have garnered increasing
attention and concern among software engineers. To tackle this issue, extensive
research has been dedicated to conducting fairness testing of ML software, and
this paper offers a comprehensive survey of existing studies in this field. We
collect 100 papers and organize them based on the testing workflow (i.e., how
to test) and testing components (i.e., what to test). Furthermore, we analyze
the research focus, trends, and promising directions in the realm of fairness
testing. We also identify widely-adopted datasets and open-source tools for
fairness testing
Input Prioritization for Testing Neural Networks
Deep neural networks (DNNs) are increasingly being adopted for sensing and
control functions in a variety of safety and mission-critical systems such as
self-driving cars, autonomous air vehicles, medical diagnostics, and industrial
robotics. Failures of such systems can lead to loss of life or property, which
necessitates stringent verification and validation for providing high
assurance. Though formal verification approaches are being investigated,
testing remains the primary technique for assessing the dependability of such
systems. Due to the nature of the tasks handled by DNNs, the cost of obtaining
test oracle data---the expected output, a.k.a. label, for a given input---is
high, which significantly impacts the amount and quality of testing that can be
performed. Thus, prioritizing input data for testing DNNs in meaningful ways to
reduce the cost of labeling can go a long way in increasing testing efficacy.
This paper proposes using gauges of the DNN's sentiment derived from the
computation performed by the model, as a means to identify inputs that are
likely to reveal weaknesses. We empirically assessed the efficacy of three such
sentiment measures for prioritization---confidence, uncertainty, and
surprise---and compare their effectiveness in terms of their fault-revealing
capability and retraining effectiveness. The results indicate that sentiment
measures can effectively flag inputs that expose unacceptable DNN behavior. For
MNIST models, the average percentage of inputs correctly flagged ranged from
88% to 94.8%
DeepEvolution: A Search-Based Testing Approach for Deep Neural Networks
The increasing inclusion of Deep Learning (DL) models in safety-critical
systems such as autonomous vehicles have led to the development of multiple
model-based DL testing techniques. One common denominator of these testing
techniques is the automated generation of test cases, e.g., new inputs
transformed from the original training data with the aim to optimize some test
adequacy criteria. So far, the effectiveness of these approaches has been
hindered by their reliance on random fuzzing or transformations that do not
always produce test cases with a good diversity. To overcome these limitations,
we propose, DeepEvolution, a novel search-based approach for testing DL models
that relies on metaheuristics to ensure a maximum diversity in generated test
cases. We assess the effectiveness of DeepEvolution in testing computer-vision
DL models and found that it significantly increases the neuronal coverage of
generated test cases. Moreover, using DeepEvolution, we could successfully find
several corner-case behaviors. Finally, DeepEvolution outperformed Tensorfuzz
(a coverage-guided fuzzing tool developed at Google Brain) in detecting latent
defects introduced during the quantization of the models. These results suggest
that search-based approaches can help build effective testing tools for DL
systems
- …