48 research outputs found

    JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

    Full text link
    Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions

    Recovering fitness gradients for interprocedural Boolean flags in search-based testing

    Get PDF
    National Research Foundation (NRF) Singapore under Corp Lab @ University scheme; National Research Foundation (NRF) Singapore under its NSoE Programm

    Ant colony optimization for object-oriented unit test generation

    Get PDF
    Generating useful unit tests for object-oriented programs is difficult for traditional optimization methods. One not only needs to identify values to be used as inputs, but also synthesize a program which creates the required state in the program under test. Many existing Automated Test Generation (ATG) approaches combine search with performance-enhancing heuristics. We present Tiered Ant Colony Optimization (Taco) for generating unit tests for object-oriented programs. The algorithm is formed of three Tiers of ACO, each of which tackles a distinct task: goal prioritization, test program synthesis, and data generation for the synthesised program. Test program synthesis allows the creation of complex objects, and exploration of program state, which is the breakthrough that has allowed the successful application of ACO to object-oriented test generation. Taco brings the mature search ecosystem of ACO to bear on ATG for complex object-oriented programs, providing a viable alternative to current approaches. To demonstrate the effectiveness of Taco, we have developed a proof-of-concept tool which successfully generated tests for an average of 54% of the methods in 170 Java classes, a result competitive with industry standard Randoop

    Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test Generation

    Full text link
    Unit testing is critical to the software development process, ensuring the correctness of basic programming units in a program (e.g., a method). Search-based software testing (SBST) is an automated approach to generating test cases. SBST generates test cases with genetic algorithms by specifying the coverage criterion (e.g., branch coverage). However, a good test suite must have different properties, which cannot be captured using an individual coverage criterion. Therefore, the state-of-the-art approach combines multiple criteria to generate test cases. Since combining multiple coverage criteria brings multiple objectives for optimization, it hurts the test suites' coverage for certain criteria compared with using the single criterion. To cope with this problem, we propose a novel approach named \textbf{smart selection}. Based on the coverage correlations among criteria and the subsumption relationships among coverage goals, smart selection selects a subset of coverage goals to reduce the number of optimization objectives and avoid missing any properties of all criteria. We conduct experiments to evaluate smart selection on 400400 Java classes with three state-of-the-art genetic algorithms under the 22-minute budget. On average, smart selection outperforms combining all goals on 65.1%65.1\% of the classes having significant differences between the two approaches. Secondly, we conduct experiments to verify our assumptions about coverage criteria relationships. Furthermore, we experiment with different budgets of 55, 88, and 1010 minutes, confirming the advantage of smart selection over combining all goals.Comment: arXiv admin note: substantial text overlap with arXiv:2208.0409

    .NET/C# instrumentation for search-based software testing

    Get PDF
    C# is one of the most widely used programming languages. However, to the best of our knowledge, there has been no work in the literature aimed at enabling search-based software testing techniques for applications running on the .NET platform, like the ones written in C#. In this paper, we propose a search-based approach and an open source tool to enable white-box testing for C# applications. The approach is integrated with a .NET bytecode instrumentation, in order to collect code coverage at runtime during the search. In addition, by taking advantage of Branch Distance, we define heuristics to better guide the search, e.g., how heuristically close it is to cover a branch in the source code. To empirically evaluate our technique, we integrated our tool into the EvoMaster test generation tool and conducted experiments on three .NET RESTful APIs as case studies. Results show that our technique significantly outperforms gray-box testing tools in terms of code coverage.publishedVersio

    INTEREVO-TR: Interactive Evolutionary Test Generation with Readability Assessment

    Get PDF
    Automated test case generation has proven to be useful to reduce the usually high expenses of software testing. However, several studies have also noted the skepticism of testers regarding the comprehension of generated test suites when compared to manually designed ones. This fact suggests that involving testers in the test generation process could be helpful to increase their acceptance of automatically-produced test suites. In this paper, we propose incorporating interactive readability assessments made by a tester into EvoSuite, a widely-known evolutionary test generation tool. Our approach, InterEvo-TR, interacts with the tester at different moments during the search and shows different test cases covering the same coverage target for their subjective evaluation. The design of such an interactive approach involves a schedule of interaction, a method to diversify the selected targets, a plan to save and handle the readability values, and some mechanisms to customize the level of engagement in the revision, among other aspects. To analyze the potential and practicability of our proposal, we conduct a controlled experiment in which 39 participants, including academics, professional developers, and student collaborators, interact with InterEvo-TR. Our results show that the strategy to select and present intermediate results is effective for the purpose of readability assessment. Furthermore, the participants' actions and responses to a questionnaire allowed us to analyze the aspects influencing test code readability and the benefits and limitations of an interactive approach in the context of test case generation, paving the way for future developments based on interactivity

    Java unit testing tool competition

    Get PDF
    We report on the results of the eighth edition of the Java unit testing tool competition. This year, two tools, EvoSuite and Randoop, were executed on a benchmark with (i) new classes under test, selected from open-source software projects, and (ii) the set of classes from one project considered in the previous edition. We relied on an updated infrastructure for the execution of the different tools and the subsequent coverage and mutation analysis based on Docker containers. We considered two different time budgets for test case generation: one an three minutes. This paper describes our methodology and statistical analysis of the results, presents the results achieved by the contestant tools and highlights the challenges we faced during the competition

    Automatic generation of smell-free unit tests

    Get PDF
    Tese de mestrado, Engenharia Informática, 2022, Universidade de Lisboa, Faculdade de CiênciasAutomated test generation tools (such as EvoSuite) typically aim to maximize code coverage. However, they frequently disregard non-coverage aspects that can be relevant for testers, such as the quality of the generated tests. Therefore, automatically generated tests are often affected by a set of test-specific bad programming practices that may hinder the quality of both test and production code, i.e., test smells. Given that other researchers have successfully integrated non-coverage quality metrics into EvoSuite, we decided to extend the EvoSuite tool such that the generated test code is smell-free. To this aim, we compiled 54 test smells from several sources and selected 16 smells that are relevant to the context of this work. We then augmented the tool with the respective test smell metrics and investigated the diffusion of the selected smells and the distribution of the metrics. Finally, we implemented an approach to optimize the test smell metrics as secondary criteria. After establishing the optimal configuration to optimize as secondary criteria (which we used throughout the remainder of the study), we conducted an empirical study to assess whether the tests became significantly less smelly. Furthermore, we studied how the proposed metrics affect the fault detection effectiveness, coverage, and size of the generated tests. Our study revealed that the proposed approach reduces the overall smelliness of the generated tests; in particular, the diffusion of the “Indirect Testing” and “Unrelated Assertions” smells improved considerably. Moreover, our approach improved the smelliness of the tests generated by EvoSuite without compromising the code coverage or fault detection effectiveness. The size and length of the generated tests were also not affected by the new secondary criteria

    SBST tool competition 2021

    Get PDF
    ​© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.We report on the organization, challenges, and results of the ninth edition of the Java Unit Testing Competition as well as the first edition of the Cyber-Physical Systems Testing Tool Competition. Java Unit Testing Competition. This year, five tools, Randoop, UtBot, Kex, Evosuite, and EvosuiteDSE, were executed on a benchmark with (i) new classes under test, selected from three open-source software projects, and (ii) the set of classes from three projects considered in the eighth edition. We relied on an improved Docker infrastructure to execute the tools and the subsequent coverage and mutation analysis. Given the high number of participants, we considered only two time budgets for test case generation: thirty seconds and two minutes. Cyber-Physical Systems Testing Tool Competition. Five tools, Deeper, Frenetic, GABExplore, GABExploit, and Swat, competed on testing self-driving car software by generating simulationbased tests using our new testing infrastructure. We considered two experimental settings to study test generators’ transitory and asymptotic behaviors and evaluated the tools’ test generation effectiveness and the exposed failures’ diversity. This paper describes our methodology, the statistical analysis of the results together with the contestant tools, and the challenges faced while running the competition experiments

    Test smells 20 years later : detectability, validity, and reliability

    Get PDF
    Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Test smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly. Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuite and JTExpert) and performed a multistage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools – one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells. Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not wellcaptured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts
    corecore