30 research outputs found

    Benchmarking: A methodology for ensuring the relative quality of recommendation systems in software engineering

    Get PDF
    This chapter describes the concepts involved in the process of benchmarking of recommendation systems. Benchmarking of recommendation systems is used to ensure the quality of a research system or production system in comparison to other systems, whether algorithmically, infrastructurally, or according to any sought-after quality. Specifically, the chapter presents evaluation of recommendation systems according to recommendation accuracy, technical constraints, and business values in the context of a multi-dimensional benchmarking and evaluation model encompassing any number of qualities into a final comparable metric. The focus is put on quality measures related to recommendation accuracy, technical factors, and business values. The chapter first introduces concepts related to evaluation and benchmarking of recommendation systems, continues with an overview of the current state of the art, then presents the multi-dimensional approach in detail. The chapter concludes with a brief discussion of the introduced concepts and a summary

    An Empirical Study of Using Large Language Models for Unit Test Generation

    Full text link
    A code generation model generates code by taking a prompt from a code comment, existing code, or a combination of both. Although code generation models (e.g. GitHub Copilot) are increasingly being adopted in practice, it is unclear whether they can successfully be used for unit test generation without fine-tuning. We investigated how well three generative models (Codex, GPT-3.5-Turbo, and StarCoder) can generate test cases to fill this gap. We used two benchmarks (HumanEval and Evosuite SF110) to investigate the context generation's effect in the unit test generation process. We evaluated the models based on compilation rates, test correctness, coverage, and test smells. We found that the Codex model achieved above 80% coverage for the HumanEval dataset, but no model had more than 2% coverage for the EvoSuite SF110 benchmark. The generated tests also suffered from test smells, such as Duplicated Asserts and Empty Tests.Comment: Preprint submitted to Journal of Systems and Software; 36 pages, 4 figures, 7 table

    A detailed investigation of the effectiveness of whole test suite generation

    Get PDF
    © 2016 The Author(s)A common application of search-based software testing is to generate test cases for all goals defined by a coverage criterion (e.g., lines, branches, mutants). Rather than generating one test case at a time for each of these goals individually, whole test suite generation optimizes entire test suites towards satisfying all goals at the same time. There is evidence that the overall coverage achieved with this approach is superior to that of targeting individual coverage goals. Nevertheless, there remains some uncertainty on (a) whether the results generalize beyond branch coverage, (b) whether the whole test suite approach might be inferior to a more focused search for some particular coverage goals, and (c) whether generating whole test suites could be optimized by only targeting coverage goals not already covered. In this paper, we perform an in-depth analysis to study these questions. An empirical study on 100 Java classes using three different coverage criteria reveals that indeed there are some testing goals that are only covered by the traditional approach, although their number is only very small in comparison with those which are exclusively covered by the whole test suite approach. We find that keeping an archive of already covered goals along with the tests covering them and focusing the search on uncovered goals overcomes this small drawback on larger classes, leading to an improved overall effectiveness of whole test suite generation

    A Memetic Algorithm for whole test suite generation

    Get PDF
    The generation of unit-level test cases for structural code coverage is a task well-suited to Genetic Algorithms. Method call sequences must be created that construct objects, put them into the right state and then execute uncovered code. However, the generation of primitive values, such as integers and doubles, characters that appear in strings, and arrays of primitive values, are not so straightforward. Often, small local changes are required to drive the value toward the one needed to execute some target structure. However, global searches like Genetic Algorithms tend to make larger changes that are not concentrated on any particular aspect of a test case. In this paper, we extend the Genetic Algorithm behind the EvoSuiTE test generation tool into a Memetic Algorithm, by equipping it with several local search operators. These operators are designed to efficiently optimize primitive values and other aspects of a test suite that allow the search for test cases to function more effectively. We evaluate our operators using a rigorous experimental methodology on over 12,000 Java classes, comprising open source classes of various different kinds, including numerical applications and text processors. Our study shows that increases in branch coverage of up to 53% are possible for an individual class in practice

    JavaScript SBST Heuristics to Enable Effective Fuzzing of NodeJS Web APIs

    Get PDF
    JavaScript is one of the most popular programming languages. However, its dynamic nature poses several challenges to automated testing techniques. In this paper, we propose an approach and open-source tool support to enable white-box testing of JavaScript applications using Search-Based Software Testing (SBST) techniques. We provide an automated approach to collect search-based heuristics like the common Branch Distance and to enable Testability Transformations. To empirically evaluate our results, we integrated our technique into the EvoMaster test generation tool, and carried out analyses on the automated system testing of RESTful and GraphQL APIs. Experiments on eight Web APIs running on NodeJS show that our technique leads to significantly better results than existing black-box and grey-box testing tools, in terms of code coverage and fault detection.publishedVersio

    Semi-automatic Search-Based Test Generation

    Full text link
    Abstract—Search-based testing techniques can efficiently generate test data to achieve high code coverage. However, when the fitness function does not provide sufficient guidance, the search will only generate optimal results by chance. Yet, where the search algorithm struggles, a human tester with domain knowledge can often produce solutions easily. We therefore include the tester in the test generation process: When the search stagnates, the tester is given an opportunity to improve the current solution, and these improvements are fed back to the search. In particular, relevant problems occur often when generating tests for object-oriented languages, where test cases are sequences of method calls. Constructing complex objects through sequences of method calls is difficult, and often the traditional branch distance offers little guidance – yet for a human tester the same task is often trivial. In this paper, we present a semi-automatic test generation approach based on our search-based EVOSUITE tool, and evaluate the usefulness and potential on a set of example classes. Keywords-test case generation; search-based testing; manual testing I

    Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study

    Get PDF
    Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners
    corecore