Search CORE

26 research outputs found

A Comparison of Reinforcement Learning Frameworks for Software Testing Tasks

Author: Khomh Foutse
Mindom Paulina Stevia Nouwou
Nikanjam Amin
Publication venue
Publication date: 29/06/2023
Field of study

Software testing activities scrutinize the artifacts and the behavior of a software product to find possible defects and ensure that the product meets its expected requirements. Recently, Deep Reinforcement Learning (DRL) has been successfully employed in complex testing tasks such as game testing, regression testing, and test case prioritization to automate the process and provide continuous adaptation. Practitioners can employ DRL by implementing from scratch a DRL algorithm or using a DRL framework. DRL frameworks offer well-maintained implemented state-of-the-art DRL algorithms to facilitate and speed up the development of DRL applications. Developers have widely used these frameworks to solve problems in various domains including software testing. However, to the best of our knowledge, there is no study that empirically evaluates the effectiveness and performance of implemented algorithms in DRL frameworks. Moreover, some guidelines are lacking from the literature that would help practitioners choose one DRL framework over another. In this paper, we empirically investigate the applications of carefully selected DRL algorithms on two important software testing tasks: test case prioritization in the context of Continuous Integration (CI) and game testing. For the game testing task, we conduct experiments on a simple game and use DRL algorithms to explore the game to detect bugs. Results show that some of the selected DRL frameworks such as Tensorforce outperform recent approaches in the literature. To prioritize test cases, we run experiments on a CI environment where DRL algorithms from different frameworks are used to rank the test cases. Our results show that the performance difference between implemented algorithms in some cases is considerable, motivating further investigation.Comment: Accepted for publication at EMSE (Empirical Software Engineering journal) 202

arXiv.org e-Print Archive

Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach

Author: Braiek Houssem Ben
Khomh Foutse
Morovati Mohammad Mehdi
Nikanjam Amin
Publication venue
Publication date: 28/11/2021
Field of study

A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep Reinforcement Learning (DRL) is the application of DL in the domain of Reinforcement Learning (RL). Like any software systems, DRL applications can fail because of faults in their programs. In this paper, we present the first attempt to categorize faults occurring in DRL programs. We manually analyzed 761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues) developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl, Tensorforce) and identified faults reported by developers/users. We labeled and taxonomized the identified faults through several rounds of discussions. The resulting taxonomy is validated using an online survey with 19 developers/researchers. To allow for the automatic detection of faults in DRL programs, we have defined a meta-model of DRL programs and developed DRLinter, a model-based fault detection approach that leverages static analysis and graph transformations. The execution flow of DRLinter consists in parsing a DRL program to generate a model conforming to our meta-model and applying detection rules on the model to identify faults occurrences. The effectiveness of DRLinter is evaluated using 15 synthetic DRLprograms in which we injected faults observed in the analyzed artifacts of the taxonomy. The results show that DRLinter can successfully detect faults in all synthetic faulty programs

arXiv.org e-Print Archive

PolyPublie

Automatic Fault Detection for Deep Learning Programs Using Graph Transformations

Author: Braiek Houssem Ben
Khomh Foutse
Morovati Mohammad Mehdi
Nikanjam Amin
Publication venue
Publication date: 30/05/2021
Field of study

Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning (DL) to solve complex real-world problems. A DL program encodes the network structure of a desirable DL model and the process by which the model learns from the training dataset. Like any software, a DL program can be faulty, which implies substantial challenges of software quality assurance, especially in safety-critical domains. It is therefore crucial to equip DL development teams with efficient fault detection techniques and tools. In this paper, we propose NeuraLint, a model-based fault detection approach for DL programs, using meta-modelling and graph transformations. First, we design a meta-model for DL programs that includes their base skeleton and fundamental properties. Then, we construct a graph-based verification process that covers 23 rules defined on top of the meta-model and implemented as graph transformations to detect faults and design inefficiencies in the generated models (i.e., instances of the meta-model). First, the proposed approach is evaluated by finding faults and design inefficiencies in 28 synthesized examples built from common problems reported in the literature. Then NeuraLint successfully finds 64 faults and design inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts and GitHub repositories. The results show that NeuraLint effectively detects faults and design issues in both synthesized and real-world examples with a recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is designed for feedforward neural networks, it can be extended to support other neural network architectures such as recurrent neural networks. Researchers can also expand our set of verification rules to cover more types of issues in DL programs

arXiv.org e-Print Archive

PolyPublie

Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing

Author: Dakhel Arghavan Moradi
Desmarais Michel C.
Khomh Foutse
Majdinasab Vahid
Nikanjam Amin
Publication venue
Publication date: 31/08/2023
Field of study

One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of tests in bug detection. To improve over this limitation, in this paper, we introduce MuTAP for improving the effectiveness of test cases generated by LLMs in terms of revealing bugs by leveraging mutation testing. Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate their performance on different benchmarks. Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in our evaluation. Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases PUTs.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

Quality Issues in Machine Learning Software Systems

Author: Abidi Mouna
Basta Ilan
Bouchoucha Rached
Côté Pierre-Olivier
Khomh Foutse
Nikanjam Amin
Publication venue
Publication date: 26/06/2023
Field of study

Context: An increasing demand is observed in various domains to employ Machine Learning (ML) for solving complex problems. ML models are implemented as software components and deployed in Machine Learning Software Systems (MLSSs). Problem: There is a strong need for ensuring the serving quality of MLSSs. False or poor decisions of such systems can lead to malfunction of other systems, significant financial losses, or even threats to human life. The quality assurance of MLSSs is considered a challenging task and currently is a hot research topic. Objective: This paper aims to investigate the characteristics of real quality issues in MLSSs from the viewpoint of practitioners. This empirical study aims to identify a catalog of quality issues in MLSSs. Method: We conduct a set of interviews with practitioners/experts, to gather insights about their experience and practices when dealing with quality issues. We validate the identified quality issues via a survey with ML practitioners. Results: Based on the content of 37 interviews, we identified 18 recurring quality issues and 24 strategies to mitigate them. For each identified issue, we describe the causes and consequences according to the practitioners' experience. Conclusion: We believe the catalog of issues developed in this study will allow the community to develop efficient quality assurance tools for ML models and MLSSs. A replication package of our study is available on our public GitHub repository

arXiv.org e-Print Archive

How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review

Author: An Le
Antoniol Giulio
Khomh Foutse
Laberge Gabriel
Laviolette François
Merlo Ettore
Mindom Paulina Stevia Nouwou
Nikanjam Amin
Pequignot Yann
Tambon Florian
Publication venue
Publication date: 01/12/2021
Field of study

Context: Machine Learning (ML) has been at the heart of many innovations over the past years. However, including it in so-called 'safety-critical' systems such as automotive or aeronautic has proven to be very challenging, since the shift in paradigm that ML brings completely changes traditional certification approaches. Objective: This paper aims to elucidate challenges related to the certification of ML-based safety-critical systems, as well as the solutions that are proposed in the literature to tackle them, answering the question 'How to Certify Machine Learning Based Safety-critical Systems?'. Method: We conduct a Systematic Literature Review (SLR) of research papers published between 2015 to 2020, covering topics related to the certification of ML systems. In total, we identified 217 papers covering topics considered to be the main pillars of ML certification: Robustness, Uncertainty, Explainability, Verification, Safe Reinforcement Learning, and Direct Certification. We analyzed the main trends and problems of each sub-field and provided summaries of the papers extracted. Results: The SLR results highlighted the enthusiasm of the community for this subject, as well as the lack of diversity in terms of datasets and type of models. It also emphasized the need to further develop connections between academia and industries to deepen the domain study. Finally, it also illustrated the necessity to build connections between the above mention main pillars that are for now mainly studied separately. Conclusion: We highlighted current efforts deployed to enable the certification of ML based software systems, and discuss some future research directions.Comment: 60 pages (92 pages with references and complements), submitted to a journal (Automated Software Engineering). Changes: Emphasizing difference traditional software engineering / ML approach. Adding Related Works, Threats to Validity and Complementary Materials. Adding a table listing papers reference for each section/subsection

arXiv.org e-Print Archive

PolyPublie

Introducing v0.5 of the AI Safety Benchmark from MLCommons

Author: Agrawal Adarsh
Akinwande Victor
Al-Nuaimi Namir
Alfaraj Najla
Alhajjar Elie
Aroyo Lora
Bavalatti Trupti
Blili-Hamelin Borhane
Bollacker Kurt
Bomassani Rishi
Boston Marisa Ferrara
Campos Siméon
Chakra Kal
Chen Canyu
Coleman Cody
Coudert Zacharie Delpierre
Derczynski Leon
Dutta Debojyoti
Eisenberg Ian
Ezick James
Frase Heather
Fuller Brian
Gandikota Ram
Gangavarapu Agasthya
Gangavarapu Ananya
Gealy James
Ghosh Rajat
Goel James
Gohar Usman
Goswami Sujata
Hale Scott A.
Hutiri Wiebke
Imperial Joseph Marvin
Jandial Surgan
Judd Nick
Juefei-Xu Felix
Kailkhura Bhavya
Khomh Foutse
Kirk Hannah Rose
Klyman Kevin
Knotz Chris
Kuchnik Michael
Kumar Shachi H.
Lengerich Chris
Liang Percy
Liao Zeyi
Long Eileen Peters
Lu Victor
Mai Yifan
Mammen Priyanka Mary
Manyeki Kelvin
Mattson Peter
McGregor Sean
Mehta Virendra
Mohammed Shafee
Moss Emanuel
Nachman Lama
Naganna Dinesh Jinenhally
Nikanjam Amin
Nushi Besmira
Oala Luis
Orr Iftach
Parrish Alicia
Patlak Cigdem
Pietri William
Poursabzi-Sangdeh Forough
Presani Eleonora
Puletti Fabrizio
Röttger Paul
Sahay Saurav
Santos Tim
Scherrer Nino
Schramowski Patrick
Sebag Alice Schoenauer
Shahbazi Abolfazl
Sharma Vin
Shen Xudong
Sistla Vamsi
Tang Leonard
Testuggine Davide
Thangarasa Vithursan
Vanschoren Joaquin
Vidgen Bertie
Watkins Elizabeth Anne
Weiss Rebecca
Welty Chris
Wilbers Tyler
Williams Adina
Wu Carole-Jean
Yadav Poonam
Yang Xianjun
Zeng Yi
Zhang Wenhui
Zhdanov Fedor
Zhu Jiacheng
Publication venue: 'Center for Open Science'
Publication date: 18/04/2024
Field of study

This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark

OPUS

A stochastic variance-reduced coordinate descent algorithm for learning sparse Bayesian network from discrete high-dimensional data

Author: Nikanjam Amin
Shajoonnezhad Nazanin
Publication venue: Springer heidelberg
Publication date: 01/01/2022
Field of study

PolyPublie