917 research outputs found
JWalk: a tool for lazy, systematic testing of java classes by design introspection and user interaction
Popular software testing tools, such as JUnit, allow frequent retesting of modified code; yet the manually created test scripts are often seriously incomplete. A unit-testing tool called JWalk has therefore been developed to address the need for systematic unit testing within the context of agile methods. The tool operates directly on the compiled code for Java classes and uses a new lazy method for inducing the changing design of a class on the fly. This is achieved partly through introspection, using Java’s reflection capability, and partly through interaction with the user, constructing and saving test oracles on the fly. Predictive rules reduce the number of oracle values that must be confirmed by the tester. Without human intervention, JWalk performs bounded exhaustive exploration of the class’s method protocols and may be directed to explore the space of algebraic constructions, or the intended design state-space of the tested class. With some human interaction, JWalk performs up to the equivalent of fully automated state-based testing, from a specification that was acquired incrementally
Identifying Patch Correctness in Test-Based Program Repair
Test-based automatic program repair has attracted a lot of attention in
recent years. However, the test suites in practice are often too weak to
guarantee correctness and existing approaches often generate a large number of
incorrect patches.
To reduce the number of incorrect patches generated, we propose a novel
approach that heuristically determines the correctness of the generated
patches. The core idea is to exploit the behavior similarity of test case
executions. The passing tests on original and patched programs are likely to
behave similarly while the failing tests on original and patched programs are
likely to behave differently. Also, if two tests exhibit similar runtime
behavior, the two tests are likely to have the same test results. Based on
these observations, we generate new test inputs to enhance the test suites and
use their behavior similarity to determine patch correctness.
Our approach is evaluated on a dataset consisting of 139 patches generated
from existing program repair systems including jGenProg, Nopol, jKali, ACS and
HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches
to be generated, without blocking any correct patches.Comment: ICSE 201
Mapping the Structure and Evolution of Software Testing Research Over the Past Three Decades
Background: The field of software testing is growing and rapidly-evolving.
Aims: Based on keywords assigned to publications, we seek to identify
predominant research topics and understand how they are connected and have
evolved.
Method: We apply co-word analysis to map the topology of testing research as
a network where author-assigned keywords are connected by edges indicating
co-occurrence in publications. Keywords are clustered based on edge density and
frequency of connection. We examine the most popular keywords, summarize
clusters into high-level research topics, examine how topics connect, and
examine how the field is changing.
Results: Testing research can be divided into 16 high-level topics and 18
subtopics. Creation guidance, automated test generation, evolution and
maintenance, and test oracles have particularly strong connections to other
topics, highlighting their multidisciplinary nature. Emerging keywords relate
to web and mobile apps, machine learning, energy consumption, automated program
repair and test generation, while emerging connections have formed between web
apps, test oracles, and machine learning with many topics. Random and
requirements-based testing show potential decline.
Conclusions: Our observations, advice, and map data offer a deeper
understanding of the field and inspiration regarding challenges and connections
to explore.Comment: To appear, Journal of Systems and Softwar
The Integration of Machine Learning into Automated Test Generation: A Systematic Mapping Study
Context: Machine learning (ML) may enable effective automated test
generation.
Objective: We characterize emerging research, examining testing practices,
researcher goals, ML techniques applied, evaluation, and challenges.
Methods: We perform a systematic mapping on a sample of 102 publications.
Results: ML generates input for system, GUI, unit, performance, and
combinatorial testing or improves the performance of existing generation
methods. ML is also used to generate test verdicts, property-based, and
expected output oracles. Supervised learning - often based on neural networks -
and reinforcement learning - often based on Q-learning - are common, and some
publications also employ unsupervised or semi-supervised learning.
(Semi-/Un-)Supervised approaches are evaluated using both traditional testing
metrics and ML-related metrics (e.g., accuracy), while reinforcement learning
is often evaluated using testing metrics tied to the reward function.
Conclusion: Work-to-date shows great promise, but there are open challenges
regarding training data, retraining, scalability, evaluation complexity, ML
algorithms employed - and how they are applied - benchmarks, and replicability.
Our findings can serve as a roadmap and inspiration for researchers in this
field.Comment: Under submission to Software Testing, Verification, and Reliability
journal. (arXiv admin note: text overlap with arXiv:2107.00906 - This is an
earlier study that this study extends
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing
One of the critical phases in software development is software testing.
Testing helps with identifying potential bugs and reducing maintenance costs.
The goal of automated test generation tools is to ease the development of tests
by suggesting efficient bug-revealing tests. Recently, researchers have
leveraged Large Language Models (LLMs) of code to generate unit tests. While
the code coverage of generated tests was usually assessed, the literature has
acknowledged that the coverage is weakly correlated with the efficiency of
tests in bug detection. To improve over this limitation, in this paper, we
introduce MuTAP for improving the effectiveness of test cases generated by LLMs
in terms of revealing bugs by leveraging mutation testing. Our goal is achieved
by augmenting prompts with surviving mutants, as those mutants highlight the
limitations of test cases in detecting bugs. MuTAP is capable of generating
effective test cases in the absence of natural language descriptions of the
Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate
their performance on different benchmarks. Our results show that our proposed
method is able to detect up to 28% more faulty human-written code snippets.
Among these, 17% remained undetected by both the current state-of-the-art fully
automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning
approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57%
on synthetic buggy code, outperforming all other approaches in our evaluation.
Our findings suggest that although LLMs can serve as a useful tool to generate
test cases, they require specific post-processing steps to enhance the
effectiveness of the generated test cases which may suffer from syntactic or
functional errors and may be ineffective in detecting certain types of bugs and
testing corner cases PUTs.Comment: 16 pages, 3 figure
Test oracle assessment and improvement
We introduce a technique for assessing and improving test oracles by reducing the incidence of both false positives and false negatives. We prove that our approach can always result in an increase in the mutual information between the actual and perfect oracles. Our technique combines test case generation to reveal false positives and mutation testing to reveal false negatives. We applied the decision support tool that implements our oracle improvement technique to five real-world subjects. The experimental results show that the fault detection rate of the oracles after improvement increases, on average, by 48.6% (86% over the implicit oracle). Three actual, exposed faults in the studied systems were subsequently confirmed and fixed by the developers
Automatic Software Repair: a Bibliography
This article presents a survey on automatic software repair. Automatic
software repair consists of automatically finding a solution to software bugs
without human intervention. This article considers all kinds of repairs. First,
it discusses behavioral repair where test suites, contracts, models, and
crashing inputs are taken as oracle. Second, it discusses state repair, also
known as runtime repair or runtime recovery, with techniques such as checkpoint
and restart, reconfiguration, and invariant restoration. The uniqueness of this
article is that it spans the research communities that contribute to this body
of knowledge: software engineering, dependability, operating systems,
programming languages, and security. It provides a novel and structured
overview of the diversity of bug oracles and repair operators used in the
literature
Can Large Language Models Write Good Property-Based Tests?
Property-based testing (PBT), while an established technique in the software
testing research community, is still relatively underused in real-world
software. Pain points in writing property-based tests include implementing
diverse random input generators and thinking of meaningful properties to test.
Developers, however, are more amenable to writing documentation; plenty of
library API documentation is available and can be used as natural language
specifications for property-based tests. As large language models (LLMs) have
recently shown promise in a variety of coding tasks, we explore the potential
of using LLMs to synthesize property-based tests. We call our approach PBT-GPT,
and propose three different strategies of prompting the LLM for PBT. We
characterize various failure modes of PBT-GPT and detail an evaluation
methodology for automatically synthesized property-based tests. PBT-GPT
achieves promising results in our preliminary studies on sample Python library
APIs in , , and
- …