299 research outputs found

    Machine learning techniques for automated software fault detection via dynamic execution data : empirical evaluation study

    Get PDF
    The biggest obstacle of automated software testing is the construction of test oracles. Today, it is possible to generate enormous amount of test cases for an arbitrary system that reach a remarkably high level of coverage, but the effectiveness of test cases is limited by the availability of test oracles that can distinguish failing executions. Previous work by the authors has explored the use of unsupervised and semi-supervised learning techniques to develop test oracles so that the correctness of software outputs and behaviours on new test cases can be predicated [1], [2], [10], and experimental results demonstrate the promise of this approach. In this paper, we present an evaluation study for test oracles based on machine-learning approaches via dynamic execution data (firstly, input/output pairs and secondly, amalgamations of input/output pairs and execution traces) by comparing their effectiveness with existing techniques from the specification mining domain (the data invariant detector Daikon [5]). The two approaches are evaluated on a range of mid-sized systems and compared in terms of their fault detection ability and false positive rate. The empirical study also discuss the major limitations and the most important properties related to the application of machine learning techniques as test oracles in practice. The study also gives a road map for further research direction in order to tackle some of discussed limitations such as accuracy and scalability. The results show that in most cases semi-supervised learning techniques performed far better as an automated test classifier than Daikon (especially in the case that input/output pairs were augmented with their execution traces). However, there is one system for which our strategy struggles and Daikon performed far better. Furthermore, unsupervised learning techniques performed on a par when compared with Daikon in several cases particularly when input/output pairs were used together with execution traces

    Learning to Encode and Classify Test Executions

    Full text link
    The challenge of automatically determining the correctness of test executions is referred to as the test oracle problem and is one of the key remaining issues for automated testing. The goal in this paper is to solve the test oracle problem in a way that is general, scalable and accurate. To achieve this, we use supervised learning over test execution traces. We label a small fraction of the execution traces with their verdict of pass or fail. We use the labelled traces to train a neural network (NN) model to learn to distinguish runtime patterns for passing versus failing executions for a given program. Our approach for building this NN model involves the following steps, 1. Instrument the program to record execution traces as sequences of method invocations and global state, 2. Label a small fraction of the execution traces with their verdicts, 3. Designing a NN component that embeds information in execution traces to fixed length vectors, 4. Design a NN model that uses the trace information for classification, 5. Evaluate the inferred classification model on unseen execution traces from the program. We evaluate our approach using case studies from different application domains: 1. Module from Ethereum Blockchain, 2. Module from PyTorch deep learning framework, 3. Microsoft SEAL encryption library components, 4. Sed stream editor, 5. Value pointer library and 6. Nine network protocols from Linux packet identifier, L7-Filter. We found the classification models for all subject programs resulted in high precision, recall and specificity, over 95%, while only training with an average 9% of the total traces. Our experiments show that the proposed neural network model is highly effective as a test oracle and is able to learn runtime patterns to distinguish passing and failing test executions for systems and tests from different application domains

    Using machine learning to classify test outcomes

    Get PDF
    When testing software it has been shown that there are substantial benefits to be gained from approaches which exercise unusual or unexplored interactions with a system - techniques such as random testing, fuzzing, and exploratory testing. However, such approaches have a drawback in that the outputs of the tests need to be manually checked for correctness, representing a significant burden for the software engineer. This paper presents a strategy to support the process of identifying which tests have passed or failed by combining clustering and semi-supervised learning. We have shown that by using machine learning it is possible to cluster test cases in such a way that those corresponding to failures concentrate into smaller clusters. Examining the test outcomes in cluster-size order has the effect of prioritising the results: those that are checked early on have a much higher probability of being a failing test. As the software engineer examines the results (and confirms or refutes the initial classification), this information is employed to bootstrap a secondary learner to further improve the accuracy of the classification of the (as yet) unchecked tests. Results from experimenting with a range of systems demonstrate the substantial benefits that can be gained from this strategy, and how remarkably accurate test output classifications can be derived from examining a relatively small proportion of results

    ALOJA: A framework for benchmarking and predictive analytics in Hadoop deployments

    Get PDF
    This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret Big Data benchmark performance data and tuning. ALOJA is part of a long-term collaboration between BSC and Microsoft to automate the characterization of cost-effectiveness on Big Data deployments, currently focusing on Hadoop. Hadoop presents a complex run-time environment, where costs and performance depend on a large number of configuration choices. The ALOJA project has created an open, vendor-neutral repository, featuring over 40,000 Hadoop job executions and their performance details. The repository is accompanied by a test-bed and tools to deploy and evaluate the cost-effectiveness of different hardware configurations, parameters and Cloud services. Despite early success within ALOJA, a comprehensive study requires automation of modeling procedures to allow an analysis of large and resource-constrained search spaces. The predictive analytics extension, ALOJA-ML, provides an automated system allowing knowledge discovery by modeling environments from observed executions. The resulting models can forecast execution behaviors, predicting execution times for new configurations and hardware choices. That also enables model-based anomaly detection or efficient benchmark guidance by prioritizing executions. In addition, the community can benefit from ALOJA data-sets and framework to improve the design and deployment of Big Data applications.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). This work is partially supported by the Ministry of Economy of Spain under contracts TIN2012-34557 and 2014SGR1051.Peer ReviewedPostprint (published version

    Exploring anomalies in time

    Get PDF

    The central role of test automation in software quality assurance

    Full text link

    Deep language models for software testing and optimisation

    Get PDF
    Developing software is difficult. A challenging part of production development is ensuring programs are correct and fast, two properties satisfied with software testing and optimisation. While both tasks still rely on manual effort and expertise, the recent surge in software applications has led them to become tedious and time-consuming. Under this fast-pace environment, manual testing and optimisation hinders productivity significantly and leads to error-prone or sub-optimal programs that waste energy and lead users to frustration. In this thesis, we propose three novel approaches to automate software testing and optimisation with modern language models based on deep learning. In contrast to our methods, existing few techniques in these two domains have limited scalability and struggle when they face real-world applications. Our first contribution lies in the field of software testing and aims to automate the test oracle problem, which is the procedure of determining the correctness of test executions. The test oracle is still largely manual, relying on human experts. Automating the oracle is a non-trivial task that requires software specifications or derived information that are often too difficult to extract. We present the first application of deep language models over program execution traces to predict runtime correctness. Our technique classifies test executions of large-scale codebases used in production as “pass” or “fail”. Our proposed approach reduces by 86% the amount of test inputs an expert has to label by training only on 14% and classifying the rest automatically. Our next two contributions improve the effectiveness of compiler optimisation. Compilers optimise programs by applying heuristic-based transformations constructed by compiler engineers. Selecting the right transformations requires extensive knowledge of the compiler, the subject program and the target architecture. Predictive models have been successfully used to automate heuristics construction but their performance is hindered by a shortage of training benchmarks in quantity and feature diversity. Our next contributions address the scarcity of compiler benchmarks by generating human-likely synthetic programs to improve the performance of predictive models. Our second contribution is BENCHPRESS, the first steerable deep learning synthesizer for executable compiler benchmarks. BENCHPRESS produces human-like programs that compile at a rate of 87%. It targets parts of the feature space previously unreachable by other synthesizers, addressing the scarcity of high-quality training data for compilers. BENCHPRESS improves the performance of a device mapping predictive model by 50% when it introduces synthetic benchmarks into its training data. BENCHPRESS is restricted by a feature-agnostic synthesizer that requires thou sands of random inferences to select a few that target the desired features. Our third contribution addresses this inefficiency. We develop BENCHDIRECT, a directed language model for compiler benchmark generation. BENCHDIRECT synthesizes programs by jointly observing the source code context and the compiler features that are targeted. This enables efficient steerable generation on large scale tasks. Compared to BENCHPRESS, BENCHDIRECT matches successfully 1.8× more Rodinia target benchmarks, while it is up to 36% more accurate and up to 72% faster in targeting three different feature spaces for compilers. All three contributions demonstrate the exciting potential of deep learning and language models to simplify the testing of programs and the construction of better optimi sation heuristics for compilers. The outcomes of this thesis provides developers with tools to keep up with the rapidly evolving landscape of software engineering

    Classifying the Correctness of Generated White-Box Tests: An Exploratory Study

    Full text link
    White-box test generator tools rely only on the code under test to select test inputs, and capture the implementation's output as assertions. If there is a fault in the implementation, it could get encoded in the generated tests. Tool evaluations usually measure fault-detection capability using the number of such fault-encoding tests. However, these faults are only detected, if the developer can recognize that the encoded behavior is faulty. We designed an exploratory study to investigate how developers perform in classifying generated white-box test as faulty or correct. We carried out the study in a laboratory setting with 54 graduate students. The tests were generated for two open-source projects with the help of the IntelliTest tool. The performance of the participants were analyzed using binary classification metrics and by coding their observed activities. The results showed that participants incorrectly classified a large number of both fault-encoding and correct tests (with median misclassification rate 33% and 25% respectively). Thus the real fault-detection capability of test generators could be much lower than typically reported, and we suggest to take this human factor into account when evaluating generated white-box tests.Comment: 13 pages, 7 figure
