13 research outputs found

    JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

    Full text link
    Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions

    Java Unit Testing Tool Competition — Fifth Round

    Get PDF
    After four successful JUnit tool competitions, we report on the achievements of a new Java Unit Testing Tool Competition. This 5th contest introduces statistical analyses in the benchmark infrastructure and has been validated with significance against the results of the previous 4th edition. Overall, the competition evaluates four automated JUnit testing tools taking as baseline human written test cases from real projects. The paper details the modifications performed to the methodology and provides full results of the competition

    Test smells 20 years later : detectability, validity, and reliability

    Get PDF
    Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly. Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multistage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools – one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells. Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not wellcaptured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts

    Private API Access and Functional Mocking in Automated Unit Test Generation

    Get PDF
    Not all object oriented code is easily testable: Dependency objects might be difficult or even impossible to instantiate, and object-oriented encapsulation makes testing potentially simple code difficult if it cannot easily be accessed. When this happens, then developers can resort to mock objects that simulate the complex dependencies, or circumvent object-oriented encapsulation and access private APIs directly through the use of, for example, Java reflection. Can automated unit test generation benefit from these techniques as well? In this paper we investigate this question by extending the EvoSuite unit test generation tool with the ability to directly access private APIs and to create mock objects using the popular Mockito framework. However, care needs to be taken that this does not impact the usefulness of the generated tests: For example, a test accessing a private field could later fail if that field is renamed, even if that renaming is part of a semantics-preserving refactoring. Such a failure would not be revealing a true regression bug, but is a false positive, which wastes the developer's time for investigating and fixing the test. Our experiments on the SF110 and Defects4J benchmarks confirm the anticipated improvements in terms of code coverage and bug finding, but also confirm the existence of false positives. However, by ensuring the test generator only uses mocking and reflection if there is no other way to reach some part of the code, their number remains small

    Automatic generation of smell-free unit tests

    Get PDF
    Tese de mestrado, Engenharia Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasAutomated test generation tools (such as EvoSuite) typically aim to maximize code coverage. However, they frequently disregard non-coverage aspects that can be relevant for testers, such as the quality of the generated tests. Therefore, automatically generated tests are often affected by a set of test-specific bad programming practices that may hinder the quality of both test and production code, i.e., test smells. Given that other researchers have successfully integrated non-coverage quality metrics into EvoSuite, we decided to extend the EvoSuite tool such that the generated test code is smell-free. To this aim, we compiled 54 test smells from several sources and selected 16 smells that are relevant to the context of this work. We then augmented the tool with the respective test smell metrics and investigated the diffusion of the selected smells and the distribution of the metrics. Finally, we implemented an approach to optimize the test smell metrics as secondary criteria. After establishing the optimal configuration to optimize as secondary criteria (which we used throughout the remainder of the study), we conducted an empirical study to assess whether the tests became significantly less smelly. Furthermore, we studied how the proposed metrics affect the fault detection effectiveness, coverage, and size of the generated tests. Our study revealed that the proposed approach reduces the overall smelliness of the generated tests; in particular, the diffusion of the “Indirect Testing” and “Unrelated Assertions” smells improved considerably. Moreover, our approach improved the smelliness of the tests generated by EvoSuite without compromising the code coverage or fault detection effectiveness. The size and length of the generated tests were also not affected by the new secondary criteria

    Revisiting test smells in automatically generated tests : limitations, pitfalls, and opportunities

    Get PDF
    ​© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Test smells attempt to capture design issues in test code that reduce their maintainability. Previous work found such smells to be highly common in automatically generated test-cases, but based this result on specific static detection rules; although these are based on the original definition of "test smells", a recent empirical study showed that developers perceive these as overly strict and non-representative of the maintainability and quality of test suites. This leads us to investigate how effective such test smell detection tools are on automatically generated test suites. In this paper, we build a dataset of 2,340 test cases automatically generated by EVOSUITE for 100 Java classes. We performed a multi-stage, cross-validated manual analysis to identify six types of test smells and label their instances. We benchmark the performance of two test smell detection tools: one widely used in prior work, and one recently introduced with the express goal to match developer perceptions of test smells. Our results show that these test smell detection strategies poorly characterized the issues in automatically generated test suites; the older tool’s detection strategies, especially, misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice; and (ii) more accurate detection strategies, to be evaluated primarily in industrial contexts

    Improving Readability in Automatic Unit Test Generation

    Get PDF
    In object-oriented programming, quality assurance is commonly provided through writing unit tests, to exercise the operations of each class. If unit tests are created and maintained manually, this can be a time-consuming and laborious task. For this reason, automatic methods are often used to generate tests that seek to cover all paths of the tested code. Search may be guided by criteria that are opaque to the programmer, resulting in test sequences that are long and confusing. This has a negative impact on test maintenance. Once tests have been created, the job is not done: programmers need to reason about the tests throughout the lifecycle, as the tested software units evolve. Maintenance includes diagnosing failing tests (whether due to a software fault or an invalid test) and preserving test oracles (ensuring that checked assertions are still relevant). Programmers also need to understand the tests created for code that they did not write themselves, in order to understand the intent of that code. If generated tests cannot be easily understood, then they will be extremely difficult to maintain. The overall objective of this thesis is to reaffirm the importance of unit test maintenance; and to offer novel techniques to improve the readability of automatically generated tests. The first contribution is an empirical survey of 225 developers from different parts of the world, who were asked to give their opinions about unit testing practices and problems. The survey responses confirm that unit testing is considered important; and that there is an appetite for higher-quality automated test generation, with a view to test maintenance. The second contribution is a domain-specific model of unit test readability, based on human judgements. The model is used to augment automated unit test generation to produce test suites with both high coverage and improved readability. In evaluations, 30 programmers preferred our improved tests and were able to answer maintenance questions 14level of accuracy. The third contribution is a novel algorithm for generating descriptive test names that summarise API- level coverage goals. Test optimisation ensures that each test is short, bears a clear relation to the covered code, and can be readily identified by programmers. In evaluations, 47 programmers agreed with the choice of synthesised names and that these were as descriptive as manually chosen names. Participants were also more accurate and faster at matching generated tests against the tested code, compared to matching with manually-chosen test names

    Doctor of Philosophy

    Get PDF
    dissertationIn computer science, functional software testing is a method of ensuring that software gives expected output on specific inputs. Software testing is conducted to ensure desired levels of quality in light of uncertainty resulting from the complexity of software. Most of today's software is written by people and software development is a creative activity. However, due to the complexity of computer systems and software development processes, this activity leads to a mismatch between the expected software functionality and the implemented one. If not addressed in a timely and proper manner, this mismatch can cause serious consequences to users of the software, such as security and privacy breaches, financial loss, and adversarial human health issues. Because of manual effort, software testing is costly. Software testing that is performed without human intervention is automatic software testing and it is one way of addressing the issue. In this work, we build upon and extend several techniques for automatic software testing. The techniques do not require any guidance from the user. Goals that are achieved with the techniques are checking for yet unknown errors, automatically testing object-oriented software, and detecting malicious software. To meet these goals, we explored several techniques and related challenges: automatic test case generation, runtime verification, dynamic symbolic execution, and the type and size of test inputs for efficient detection of malicious software via machine learning. Our work targets software written in the Java programming language, though the techniques are general and applicable to other languages. We performed an extensive evaluation on freely available Java software projects, a flight collision avoidance system, and thousands of applications for the Android operating system. Evaluation results show to what extent dynamic symbolic execution is applicable in testing object-oriented software, they show correctness of the flight system on millions of automatically customized and generated test cases, and they show that simple and relatively small inputs in random testing can lead to effective malicious software detection

    Exploring means to facilitate software debugging

    Get PDF
    In this thesis, several aspects of software debugging from automated crash reproduction to bug report analysis and use of contracts have been studied.Algorithms and the Foundations of Software technolog

    Search-based Unit Test Generation for Evolving Software

    Get PDF
    Search-based software testing has been successfully applied to generate unit test cases for object-oriented software. Typically, in search-based test generation approaches, evolutionary search algorithms are guided by code coverage criteria such as branch coverage to generate tests for individual coverage objectives. Although it has been shown that this approach can be effective, there remain fundamental open questions. In particular, which criteria should test generation use in order to produce the best test suites? Which evolutionary algorithms are more effective at generating test cases with high coverage? How to scale up search-based unit test generation to software projects consisting of large numbers of components, evolving and changing frequently over time? As a result, the applicability of search-based test generation techniques in practice is still fundamentally limited. In order to answer these fundamental questions, we investigate the following improvements to search-based testing. First, we propose the simultaneous optimisation of several coverage criteria at the same time using an evolutionary algorithm, rather than optimising for individual criteria. We then perform an empirical evaluation of different evolutionary algorithms to understand the influence of each one on the test optimisation problem. We then extend a coverage-based test generation with a non-functional criterion to increase the likelihood of detecting faults as well as helping developers to identify the locations of the faults. Finally, we propose several strategies and tools to efficiently apply search-based test generation techniques in large and evolving software projects. Our results show that, overall, the optimisation of several coverage criteria is efficient, there is indeed an evolutionary algorithm that clearly works better for test generation problem than others, the extended coverage-based test generation is effective at revealing and localising faults, and our proposed strategies, specifically designed to test entire software projects in a continuous way, improve efficiency and lead to higher code coverage. Consequently, the techniques and toolset presented in this thesis - which provides support to all contributions here described - brings search-based software testing one step closer to practical usage, by equipping software engineers with the state of the art in automated test generation
    corecore