6 research outputs found
Inferring Automatic Test Oracles
We propose the use of search based learning from
existing open source test suites to automatically generate partially
correct test oracles. We argue that mutation testing and nversion
computing (augmented by deep learning and other
soft computing techniques), will be able to predict whether a
programâs output is correct sufficiently accurately to be useful
Learning to Encode and Classify Test Executions
The challenge of automatically determining the correctness of test executions
is referred to as the test oracle problem and is one of the key remaining
issues for automated testing. The goal in this paper is to solve the test
oracle problem in a way that is general, scalable and accurate.
To achieve this, we use supervised learning over test execution traces. We
label a small fraction of the execution traces with their verdict of pass or
fail. We use the labelled traces to train a neural network (NN) model to learn
to distinguish runtime patterns for passing versus failing executions for a
given program. Our approach for building this NN model involves the following
steps, 1. Instrument the program to record execution traces as sequences of
method invocations and global state, 2. Label a small fraction of the execution
traces with their verdicts, 3. Designing a NN component that embeds information
in execution traces to fixed length vectors, 4. Design a NN model that uses the
trace information for classification, 5. Evaluate the inferred classification
model on unseen execution traces from the program.
We evaluate our approach using case studies from different application
domains: 1. Module from Ethereum Blockchain, 2. Module from PyTorch deep
learning framework, 3. Microsoft SEAL encryption library components, 4. Sed
stream editor, 5. Value pointer library and 6. Nine network protocols from
Linux packet identifier, L7-Filter. We found the classification models for all
subject programs resulted in high precision, recall and specificity, over 95%,
while only training with an average 9% of the total traces. Our experiments
show that the proposed neural network model is highly effective as a test
oracle and is able to learn runtime patterns to distinguish passing and failing
test executions for systems and tests from different application domains
Deep language models for software testing and optimisation
Developing software is difficult. A challenging part of production development is ensuring programs are correct and fast, two properties satisfied with software testing and
optimisation. While both tasks still rely on manual effort and expertise, the recent
surge in software applications has led them to become tedious and time-consuming.
Under this fast-pace environment, manual testing and optimisation hinders productivity significantly and leads to error-prone or sub-optimal programs that waste energy
and lead users to frustration. In this thesis, we propose three novel approaches to automate software testing and optimisation with modern language models based on deep
learning. In contrast to our methods, existing few techniques in these two domains
have limited scalability and struggle when they face real-world applications.
Our first contribution lies in the field of software testing and aims to automate
the test oracle problem, which is the procedure of determining the correctness of test
executions. The test oracle is still largely manual, relying on human experts. Automating the oracle is a non-trivial task that requires software specifications or derived
information that are often too difficult to extract. We present the first application of
deep language models over program execution traces to predict runtime correctness.
Our technique classifies test executions of large-scale codebases used in production as
âpassâ or âfailâ. Our proposed approach reduces by 86% the amount of test inputs an
expert has to label by training only on 14% and classifying the rest automatically.
Our next two contributions improve the effectiveness of compiler optimisation.
Compilers optimise programs by applying heuristic-based transformations constructed
by compiler engineers. Selecting the right transformations requires extensive knowledge of the compiler, the subject program and the target architecture. Predictive models
have been successfully used to automate heuristics construction but their performance
is hindered by a shortage of training benchmarks in quantity and feature diversity. Our
next contributions address the scarcity of compiler benchmarks by generating human-likely synthetic programs to improve the performance of predictive models.
Our second contribution is BENCHPRESS, the first steerable deep learning synthesizer for executable compiler benchmarks. BENCHPRESS produces human-like programs that compile at a rate of 87%. It targets parts of the feature space previously
unreachable by other synthesizers, addressing the scarcity of high-quality training data
for compilers. BENCHPRESS improves the performance of a device mapping predictive model by 50% when it introduces synthetic benchmarks into its training data. BENCHPRESS is restricted by a feature-agnostic synthesizer that requires thou sands of random inferences to select a few that target the desired features. Our third
contribution addresses this inefficiency. We develop BENCHDIRECT, a directed language model for compiler benchmark generation. BENCHDIRECT synthesizes programs by jointly observing the source code context and the compiler features that
are targeted. This enables efficient steerable generation on large scale tasks. Compared to BENCHPRESS, BENCHDIRECT matches successfully 1.8Ă more Rodinia target benchmarks, while it is up to 36% more accurate and up to 72% faster in targeting
three different feature spaces for compilers.
All three contributions demonstrate the exciting potential of deep learning and language models to simplify the testing of programs and the construction of better optimi sation heuristics for compilers. The outcomes of this thesis provides developers with
tools to keep up with the rapidly evolving landscape of software engineering
LASSO â an observatorium for the dynamic selection, analysis and comparison of software
Mining software repositories at the scale of 'big code' (i.e., big data) is a challenging activity. As well as finding a suitable software corpus and making it programmatically accessible through an index or database, researchers and practitioners have to establish an efficient analysis infrastructure and precisely define the metrics and data extraction approaches to be applied. Moreover, for analysis results to be generalisable, these tasks have to be applied at a large enough scale to have statistical significance, and if they are to be repeatable, the artefacts need to be carefully maintained and curated over time. Today, however, a lot of this work is still performed by human beings on a case-by-case basis, with the level of effort involved often having a significant negative impact on the generalisability and repeatability of studies, and thus on their overall scientific value.
The general purpose, 'code mining' repositories and infrastructures that have emerged in recent years represent a significant step forward because they automate many software mining tasks at an ultra-large scale and allow researchers and practitioners to focus on defining the questions they would like to explore at an abstract level. However, they are currently limited to static analysis and data extraction techniques, and thus cannot support (i.e., help automate) any studies which involve the execution of software systems. This includes experimental validations of techniques and tools that hypothesise about the behaviour (i.e., semantics) of software, or data analysis and extraction techniques that aim to measure dynamic properties of software.
In this thesis a platform called LASSO (Large-Scale Software Observatorium) is introduced that overcomes this limitation by automating the collection of dynamic (i.e., execution-based) information about software alongside static information. It features a single, ultra-large scale corpus of executable software systems created by amalgamating existing Open Source software repositories and a dedicated DSL for defining abstract selection and analysis pipelines. Its key innovations are integrated capabilities for searching for selecting software systems based on their exhibited behaviour and an 'arena' that allows their responses to software tests to be compared in a purely data-driven way. We call the platform a 'software observatorium' since it is a place where the behaviour of large numbers of software systems can be observed, analysed and compared