26 research outputs found
A Comparison of Reinforcement Learning Frameworks for Software Testing Tasks
Software testing activities scrutinize the artifacts and the behavior of a
software product to find possible defects and ensure that the product meets its
expected requirements. Recently, Deep Reinforcement Learning (DRL) has been
successfully employed in complex testing tasks such as game testing, regression
testing, and test case prioritization to automate the process and provide
continuous adaptation. Practitioners can employ DRL by implementing from
scratch a DRL algorithm or using a DRL framework. DRL frameworks offer
well-maintained implemented state-of-the-art DRL algorithms to facilitate and
speed up the development of DRL applications. Developers have widely used these
frameworks to solve problems in various domains including software testing.
However, to the best of our knowledge, there is no study that empirically
evaluates the effectiveness and performance of implemented algorithms in DRL
frameworks. Moreover, some guidelines are lacking from the literature that
would help practitioners choose one DRL framework over another. In this paper,
we empirically investigate the applications of carefully selected DRL
algorithms on two important software testing tasks: test case prioritization in
the context of Continuous Integration (CI) and game testing. For the game
testing task, we conduct experiments on a simple game and use DRL algorithms to
explore the game to detect bugs. Results show that some of the selected DRL
frameworks such as Tensorforce outperform recent approaches in the literature.
To prioritize test cases, we run experiments on a CI environment where DRL
algorithms from different frameworks are used to rank the test cases. Our
results show that the performance difference between implemented algorithms in
some cases is considerable, motivating further investigation.Comment: Accepted for publication at EMSE (Empirical Software Engineering
journal) 202
Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach
A growing demand is witnessed in both industry and academia for employing
Deep Learning (DL) in various domains to solve real-world problems. Deep
Reinforcement Learning (DRL) is the application of DL in the domain of
Reinforcement Learning (RL). Like any software systems, DRL applications can
fail because of faults in their programs. In this paper, we present the first
attempt to categorize faults occurring in DRL programs. We manually analyzed
761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues)
developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl,
Tensorforce) and identified faults reported by developers/users. We labeled and
taxonomized the identified faults through several rounds of discussions. The
resulting taxonomy is validated using an online survey with 19
developers/researchers. To allow for the automatic detection of faults in DRL
programs, we have defined a meta-model of DRL programs and developed DRLinter,
a model-based fault detection approach that leverages static analysis and graph
transformations. The execution flow of DRLinter consists in parsing a DRL
program to generate a model conforming to our meta-model and applying detection
rules on the model to identify faults occurrences. The effectiveness of
DRLinter is evaluated using 15 synthetic DRLprograms in which we injected
faults observed in the analyzed artifacts of the taxonomy. The results show
that DRLinter can successfully detect faults in all synthetic faulty programs
Automatic Fault Detection for Deep Learning Programs Using Graph Transformations
Nowadays, we are witnessing an increasing demand in both corporates and
academia for exploiting Deep Learning (DL) to solve complex real-world
problems. A DL program encodes the network structure of a desirable DL model
and the process by which the model learns from the training dataset. Like any
software, a DL program can be faulty, which implies substantial challenges of
software quality assurance, especially in safety-critical domains. It is
therefore crucial to equip DL development teams with efficient fault detection
techniques and tools. In this paper, we propose NeuraLint, a model-based fault
detection approach for DL programs, using meta-modelling and graph
transformations. First, we design a meta-model for DL programs that includes
their base skeleton and fundamental properties. Then, we construct a
graph-based verification process that covers 23 rules defined on top of the
meta-model and implemented as graph transformations to detect faults and design
inefficiencies in the generated models (i.e., instances of the meta-model).
First, the proposed approach is evaluated by finding faults and design
inefficiencies in 28 synthesized examples built from common problems reported
in the literature. Then NeuraLint successfully finds 64 faults and design
inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts
and GitHub repositories. The results show that NeuraLint effectively detects
faults and design issues in both synthesized and real-world examples with a
recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is
designed for feedforward neural networks, it can be extended to support other
neural network architectures such as recurrent neural networks. Researchers can
also expand our set of verification rules to cover more types of issues in DL
programs
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing
One of the critical phases in software development is software testing.
Testing helps with identifying potential bugs and reducing maintenance costs.
The goal of automated test generation tools is to ease the development of tests
by suggesting efficient bug-revealing tests. Recently, researchers have
leveraged Large Language Models (LLMs) of code to generate unit tests. While
the code coverage of generated tests was usually assessed, the literature has
acknowledged that the coverage is weakly correlated with the efficiency of
tests in bug detection. To improve over this limitation, in this paper, we
introduce MuTAP for improving the effectiveness of test cases generated by LLMs
in terms of revealing bugs by leveraging mutation testing. Our goal is achieved
by augmenting prompts with surviving mutants, as those mutants highlight the
limitations of test cases in detecting bugs. MuTAP is capable of generating
effective test cases in the absence of natural language descriptions of the
Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate
their performance on different benchmarks. Our results show that our proposed
method is able to detect up to 28% more faulty human-written code snippets.
Among these, 17% remained undetected by both the current state-of-the-art fully
automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning
approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57%
on synthetic buggy code, outperforming all other approaches in our evaluation.
Our findings suggest that although LLMs can serve as a useful tool to generate
test cases, they require specific post-processing steps to enhance the
effectiveness of the generated test cases which may suffer from syntactic or
functional errors and may be ineffective in detecting certain types of bugs and
testing corner cases PUTs.Comment: 16 pages, 3 figure
Quality Issues in Machine Learning Software Systems
Context: An increasing demand is observed in various domains to employ
Machine Learning (ML) for solving complex problems. ML models are implemented
as software components and deployed in Machine Learning Software Systems
(MLSSs). Problem: There is a strong need for ensuring the serving quality of
MLSSs. False or poor decisions of such systems can lead to malfunction of other
systems, significant financial losses, or even threats to human life. The
quality assurance of MLSSs is considered a challenging task and currently is a
hot research topic. Objective: This paper aims to investigate the
characteristics of real quality issues in MLSSs from the viewpoint of
practitioners. This empirical study aims to identify a catalog of quality
issues in MLSSs. Method: We conduct a set of interviews with
practitioners/experts, to gather insights about their experience and practices
when dealing with quality issues. We validate the identified quality issues via
a survey with ML practitioners. Results: Based on the content of 37 interviews,
we identified 18 recurring quality issues and 24 strategies to mitigate them.
For each identified issue, we describe the causes and consequences according to
the practitioners' experience. Conclusion: We believe the catalog of issues
developed in this study will allow the community to develop efficient quality
assurance tools for ML models and MLSSs. A replication package of our study is
available on our public GitHub repository
How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review
Context: Machine Learning (ML) has been at the heart of many innovations over
the past years. However, including it in so-called 'safety-critical' systems
such as automotive or aeronautic has proven to be very challenging, since the
shift in paradigm that ML brings completely changes traditional certification
approaches.
Objective: This paper aims to elucidate challenges related to the
certification of ML-based safety-critical systems, as well as the solutions
that are proposed in the literature to tackle them, answering the question 'How
to Certify Machine Learning Based Safety-critical Systems?'.
Method: We conduct a Systematic Literature Review (SLR) of research papers
published between 2015 to 2020, covering topics related to the certification of
ML systems. In total, we identified 217 papers covering topics considered to be
the main pillars of ML certification: Robustness, Uncertainty, Explainability,
Verification, Safe Reinforcement Learning, and Direct Certification. We
analyzed the main trends and problems of each sub-field and provided summaries
of the papers extracted.
Results: The SLR results highlighted the enthusiasm of the community for this
subject, as well as the lack of diversity in terms of datasets and type of
models. It also emphasized the need to further develop connections between
academia and industries to deepen the domain study. Finally, it also
illustrated the necessity to build connections between the above mention main
pillars that are for now mainly studied separately.
Conclusion: We highlighted current efforts deployed to enable the
certification of ML based software systems, and discuss some future research
directions.Comment: 60 pages (92 pages with references and complements), submitted to a
journal (Automated Software Engineering). Changes: Emphasizing difference
traditional software engineering / ML approach. Adding Related Works, Threats
to Validity and Complementary Materials. Adding a table listing papers
reference for each section/subsection
Introducing v0.5 of the AI Safety Benchmark from MLCommons
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark