96 research outputs found

    Using Machine Learning to Generate Test Oracles: A Systematic Literature Review

    Get PDF
    Machine learning may enable the automated generation of test oracles. We have characterized emerging research in this area through a systematic literature review examining oracle types, researcher goals, the ML techniques applied, how the generation process was assessed, and the open research challenges in this emerging field.Based on a sample of 22 relevant studies, we observed that ML algorithms generated test verdict, metamorphic relation, and - most commonly - expected output oracles. Almost all studies employ a supervised or semi-supervised approach, trained on labeled system executions or code metadata - including neural networks, support vector machines, adaptive boosting, and decision trees. Oracles are evaluated using the mutation score, correct classifications, accuracy, and ROC. Work-to-date show great promise, but there are significant open challenges regarding the requirements imposed on training data, the complexity of modeled functions, the ML algorithms employed - and how they are applied - the benchmarks used by researchers, and replicability of the studies. We hope that our findings will serve as a roadmap and inspiration for researchers in this field

    Using Machine Learning to Generate Test Oracles: A Systematic Literature Review

    Get PDF
    Machine learning may enable the automated generation of test oracles. We have characterized emerging research in this area through a systematic literature review examining oracle types, researcher goals, the ML techniques applied, how the generation process was assessed, and the open research challenges in this emerging field. Based on a sample of 22 relevant studies, we observed that ML algorithms generated test verdict, metamorphic relation, and - most commonly - expected output oracles. Almost all studies employ a supervised or semi-supervised approach, trained on labeled system executions or code metadata - including neural networks, support vector machines, adaptive boosting, and decision trees. Oracles are evaluated using the mutation score, correct classifications, accuracy, and ROC. Work-to-date show great promise, but there are significant open challenges regarding the requirements imposed on training data, the complexity of modeled functions, the ML algorithms employed - and how they are applied - the benchmarks used by researchers, and replicability of the studies. We hope that our findings will serve as a roadmap and inspiration for researchers in this field.Comment: Pre-print. Article accepted to 1st International Workshop on Test Oracles at ESEC/FSE 202

    The Integration of Machine Learning into Automated Test Generation: A Systematic Mapping Study

    Get PDF
    Context: Machine learning (ML) may enable effective automated test generation. Objective: We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges. Methods: We perform a systematic mapping on a sample of 102 publications. Results: ML generates input for system, GUI, unit, performance, and combinatorial testing or improves the performance of existing generation methods. ML is also used to generate test verdicts, property-based, and expected output oracles. Supervised learning - often based on neural networks - and reinforcement learning - often based on Q-learning - are common, and some publications also employ unsupervised or semi-supervised learning. (Semi-/Un-)Supervised approaches are evaluated using both traditional testing metrics and ML-related metrics (e.g., accuracy), while reinforcement learning is often evaluated using testing metrics tied to the reward function. Conclusion: Work-to-date shows great promise, but there are open challenges regarding training data, retraining, scalability, evaluation complexity, ML algorithms employed - and how they are applied - benchmarks, and replicability. Our findings can serve as a roadmap and inspiration for researchers in this field.Comment: Under submission to Software Testing, Verification, and Reliability journal. (arXiv admin note: text overlap with arXiv:2107.00906 - This is an earlier study that this study extends

    Fairness Testing: A Comprehensive Survey and Analysis of Trends

    Full text link
    Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely-adopted datasets and open-source tools for fairness testing

    Generating metamorphic relations for cyber-physical systems with genetic programming: an industrial case study

    Get PDF
    One of the major challenges in the verification of complex industrial Cyber-Physical Systems is the difficulty of determining whether a particular system output or behaviour is correct or not, the socalled test oracle problem. Metamorphic testing alleviates the oracle problem by reasoning on the relations that are expected to hold among multiple executions of the system under test, which are known as Metamorphic Relations (MRs). However, the development of effective MRs is often challenging and requires the involvement of domain experts. In this paper, we present a case study aiming at automating this process. To this end,we implemented GAssertMRs, a tool to automatically generate MRs with genetic programming. We assess the cost-effectiveness of this tool in the context of an industrial case study from the elevation domain. Our experimental results show that in most cases GAssertMRs outperforms the other baselines, including manually generated MRs developed with the help of domain experts. We then describe the lessons learned from our experiments and we outline the future work for the adoption of this technique by industrial practitioners

    Perfect is the enemy of test oracle

    Full text link
    Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents SEER, a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, SEER jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate "fail" labels, when test inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average.Comment: Published in ESEC/FSE 202

    Automated Testing and Improvement of Named Entity Recognition Systems

    Full text link
    Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.Comment: Accepted by ESEC/FSE'2

    Versatile Deep Learning Forecasting Application with Metamorphic Quality Assurance

    Get PDF
    Accurate estimates of fresh produce (FP) yields and prices are crucial for having fair bidding prices by retailers along with informed asking prices by farmers, leading to the best prices for customers. To have accurate estimates, the state-of-the-art deep learning (DL) models for forecasting FP yields and prices, including both station-based and satellite based models, are improved in this thesis by providing a new deep learning model structure. The scope of this work covers forecasting a horizon of 5 weeks ahead for the fresh produce yields and prices. The proposed structure is built using an ensemble of Attention Deep Feedforward Neural Network with Gated Recurrent Units (ADGRU) and Deep Feedforward Neural Network with embedded GRU units (DFNNGRU); (DFNNGRU-ADGRU ENS). The station-based version of the ensemble is trained and tested using as input the soil moisture and temperature parameters retrieved from land stations. This station-based ensemble model is found to outperform the literature model by 24% improvement in the AGM score for yield forecasting and 37.5% for price forecasting. For the satellite-based model, the best satellite image preprocessing technique must be found to represent the images with less data for efficiency. Therefore, a preprocessing approach based on averaging is proposed and implemented then compared with the literature approach, which is based on histograms, where the proposed approach improves performance by 20%. The proposed Deep Feed Forward Neural Network with Embedded Gated Recurrent Units (DFNNGRU) ensembled with Attention Deep GRUs (ADGRU) is then tested against well-performing models of Stacked-AutoEncoder (SAE) ensembled with Convolution Neural Networks with Long-short term memory (CNNLSTM), where the proposed model is found to outperform the literature model by 12.5%. In addition, interpolation techniques are used to estimate the missing VIs values due to the low frequency of capturing the satellite images by Landsat. A comparative analysis is conducted to choose the most effective technique, which is found to be Cubic Spline interpolation. The effect of adding the VIs as input parameters on the forecasting performance of the deep learning model is assessed and the most effective VIs are selected. One VI, which is the Normalized Difference Vegetation Index (NDVI), proves to be the most effective index in forecasting yield with an enhancement of 12.5% in AGM score. A novel transfer learning (TL) framework is proposed for better generalizability. After finding the best DL forecasting model, a TL framework is proposed to enhance that model generalization to other FPs by using FP similarity, clustering, and TL techniques customized to fit the problem in hand. Furthermore, the similarity algorithms found in literature are improved by considering the time series features rather than the absolute values of their points. In addition, the FPs are clustered using a hierarchical clustering technique utilizing the complete linkage of a dendrogram to automate the process of finding the similarity thresholds and avoid setting them arbitrarily. Finally, the transfer learning is applied by freezing some layers of the proposed ensemble model and fine-tuning the rest leading to significant improvement in AGM compared to the best literature model. Finally, a forecasting application is implemented to facilitate the use of the proposed models by the end users through a friendly interface. For testing the quality of the application deployed code and models, metamorphic testing is applied to assess the effectiveness of the machine learning models while machine learning is used to automatically detect the main metamorphic relations in the software code. The interactive role played by metamorphic testing and machine learning is investigated through the quality assurance of the forecasting application. The datasets used to train and test the deep learning forecasting models as well as the forecasting models are verified using metamorphic tests and the metamorphic relations in the generalization code are automatically detected using Support Vector Machine (SVM) models. Testing has revealed the unmatched requirements that are fixed to bring forward a valid application with sound data, effective models, and valid generalization code

    Towards Debugging and Testing Deep Learning Systems

    Get PDF
    Au cours des dernières années, l’apprentissage profond, en anglais Deep Learning (DL) a fait d’énormes progrès, en atteignant et dépassant même parfois le niveau de performance des humains pour différentes tâches, telles que la classification des images et la reconnaissance vocale. Grâce à ces progrès, nous constatons une large adoption du DL dans des applications critiques, telles que la conduite autonome de véhicules, la prévention et la détection du crime, et le traitement médical. Cependant, malgré leurs progrès spectaculaires, les systèmes de DL, tout comme les logiciels traditionnels, présentent souvent des comportements erronés en raison de l’existence de défauts cachés ou d’inefficacités. Ces comportements erronés peuvent être à l’origine d’accidents catastrophiques. Ainsi, l’assurance de la qualité des logiciels (SQA), y compris la fiabilité et la robustesse, pour les systèmes de DL devient une préoccupation majeure. Les tests traditionnels pour les modèles de DL consistent à mesurer leurs performances sur des données collectées manuellement ; ils dépendent donc fortement de la qualité des données de test qui, souvent, n’incluent pas de données d’entrée rares, comme en témoignent les récents accidents de voitures avec conduite autonome (exemple Tesla/Uber). Les techniques de test avancées sont très demandées pour améliorer la fiabilité des systèmes de DL. Néanmoins, les tests des systèmes de DL posent des défis importants, en raison de leur nature non-déterministe puisqu’ils suivent un paradigme axé sur les données (la tâche cible est apprise statistiquement) et leur manque d’oracle puisqu’ils sont conçus principalement pour fournir la réponse. Récemment, les chercheurs en génie logiciel ont commencé à adapter des concepts du domaine du test logiciel tels que la couverture des cas de tests et les pseudo-oracles, pour résoudre ces difficultés. Malgré les résultats prometteurs obtenus de cette rénovation des méthodes existantes de test logiciel, le domaine du test des systèmes de DL est encore immature et les méthodes proposées à ce jour ne sont pas très efficaces. Dans ce mémoire, nous examinons les solutions existantes proposées pour tester les systèmes de DL et proposons quelques nouvelles techniques. Nous réalisons cet objectif en suivant une approche systématique qui consiste à : (1) étudier les problèmes et les défis liés aux tests des logiciels de DL; (2) souligner les forces et les faiblesses des techniques de test logiciel adaptées aux systèmes de DL; (3) proposer de nouvelles solutions de test pour combler certaines lacunes identifiées dans la littérature, et potentiellement aider à améliorer l’assurance qualité des systèmes de DL.----------ABSTRACT: Over the past few years, Deep Learning (DL) has made tremendous progress, achieving or surpassing human-level performance for different tasks such as image classification and speech recognition. Thanks to these advances, we are witnessing a wide adoption of DL in safetycritical applications such as autonomous driving cars, crime prevention and detection, and medical treatment. However, despite their spectacular progress, DL systems, just like traditional software systems, often exhibit erroneous corner-cases behaviors due to the existence of latent defects or inefficiencies, and which can lead to catastrophic accidents. Thus, software quality assurance (SQA), including reliability and robustness, for DL systems becomes a big concern. Traditional testing for DL models consists of measuring their performance on manually collected data ; so it heavily depends on the quality of the test data that often fails to include rare inputs, as evidenced by recent autonomous-driving car accidents (e.g., Tesla/Uber). Advanced testing techniques are in high demand to improve the trustworthiness of DL systems. Nevertheless, DL testing poses significant challenges stemming from the non-deterministic nature of DL systems (since they follow a data-driven paradigm ; the target task is learned statistically) and their lack of oracle (since they are designed principally to provide the answer). Recently, software researchers have started adapting concepts from the software testing domain such as test coverage and pseudo-oracles to tackle these difficulties. Despite some promising results obtained from adapting existing software testing methods, current software testing techniques for DL systems are still quite immature. In this thesis, we examine existing testing techniques for DL systems and propose some new techniques. We achieve this by following a systematic approach consisting of : (1) investigating DL software issues and testing challenges ; (2) outlining the strengths and weaknesses of the software-based testing techniques adapted for DL systems ; and (3) proposing novel testing solutions to fill some of the identified literature gaps, and potentially help improving the SQA of DL systems

    Performance-Driven Metamorphic Testing of Cyber-Physical Systems

    Get PDF
    Cyber-physical systems (CPSs) are a new generation of systems, which integrate software with physical processes. The increasing complexity of these systems, combined with the un certainty in their interactions with the physical world, makes the definition of effective test oracles especially challenging, facing the well-known test oracle problem. Metamorphic testing has shown great potential to alleviate the test oracle problem by exploiting the relations among the inputs and outputs of different executions of the system, so-called metamorphic relations (MRs). In this article, we propose an MR pattern called PV for the identification of performance-driven MRs, and we show its applicability in two CPSs from different domains, which are automated navigation systems and elevator control systems. For the evaluation, we as sessed the effectiveness of this approach for detecting failures in an open-source simulation-based autonomous navigation system, as well as in an industrial case study from the elevation domain. We derive concrete MRs based on the PV pattern for both case studies, and we evaluate their effectiveness with seeded faults. Results show that the approach is effective at detecting over 88% of the seeded faults, while keeping the ratio of FPs at 4% or lower.European Union's Horizon 2020 Research and Innovation Programme (Grant Number: 871319)Junta de AndalucĂ­a US-1264651 (APOLO)Junta de AndalucĂ­a P18-FR-2895 (EKIPMENT-PLUS)Ministerio de Ciencia e InnovaciĂłn RTI2018-101204-B-C21 (HORATIO)Mondragon Unibertsitatea IT1519-2
    • …
    corecore