31 research outputs found

    JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

    Full text link
    Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions

    Basic block coverage for search-based unit testing and crash reproduction

    Get PDF
    Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based unit test generation (using the DynaMOSA algorithm) and search-based crash reproduction (using the STDistance and WeightedSum fitness functions). Our results show that for unit test generation, BBC improves the branch coverage of the generated tests. Although small (∼ 1.5%), this improvement in the branch coverage is systematic and leads to an increase of the output domain coverage and implicit runtime exception coverage, and of the diversity of runtime states. In terms of crash reproduction, in the combination of STDistance and WeightedSum, BBC helps in reproducing 3 new crashes for each fitness function. BBC significantly decreases the time required to reproduce 43.5% and 45.1% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 71.7% (for STDistance) and 68.7% (for WeightedSum) on average.Software Engineerin

    KD-ART: Should we intensify or diversify tests to kill mutants?

    Get PDF
    CONTEXT: Adaptive Random Testing (ART) spreads test cases evenly over the input domain. Yet once a fault is found, decisions must be made to diversify or intensify subsequent inputs. Diversification employs a wide range of tests to increase the chances of finding new faults. Intensification selects test inputs similar to those previously shown to be successful. OBJECTIVE: Explore the trade-off between diversification and intensification to kill mutants. METHOD: We augment Adaptive Random Testing (ART) to estimate the Kernel Density (KD–ART) of input values found to kill mutants. KD–ART was first proposed at the 10th International Workshop on Mutation Analysis. We now extend this work to handle real world non numeric applications. Specifically we incorporate a technique to support programs with input parameters that have composite data types (such as arrays and structs). RESULTS: Intensification is the most effective strategy for the numerical programs (it achieves 8.5% higher mutation score than ART). By contrast, diversification seems more effective for programs with composite inputs. KD–ART kills mutants 15.4 times faster than ART. CONCLUSION: Intensify tests for numerical types, but diversify them for composite types

    Doctor of Philosophy

    Get PDF
    dissertationIn computer science, functional software testing is a method of ensuring that software gives expected output on specific inputs. Software testing is conducted to ensure desired levels of quality in light of uncertainty resulting from the complexity of software. Most of today's software is written by people and software development is a creative activity. However, due to the complexity of computer systems and software development processes, this activity leads to a mismatch between the expected software functionality and the implemented one. If not addressed in a timely and proper manner, this mismatch can cause serious consequences to users of the software, such as security and privacy breaches, financial loss, and adversarial human health issues. Because of manual effort, software testing is costly. Software testing that is performed without human intervention is automatic software testing and it is one way of addressing the issue. In this work, we build upon and extend several techniques for automatic software testing. The techniques do not require any guidance from the user. Goals that are achieved with the techniques are checking for yet unknown errors, automatically testing object-oriented software, and detecting malicious software. To meet these goals, we explored several techniques and related challenges: automatic test case generation, runtime verification, dynamic symbolic execution, and the type and size of test inputs for efficient detection of malicious software via machine learning. Our work targets software written in the Java programming language, though the techniques are general and applicable to other languages. We performed an extensive evaluation on freely available Java software projects, a flight collision avoidance system, and thousands of applications for the Android operating system. Evaluation results show to what extent dynamic symbolic execution is applicable in testing object-oriented software, they show correctness of the flight system on millions of automatically customized and generated test cases, and they show that simple and relatively small inputs in random testing can lead to effective malicious software detection

    Exploring means to facilitate software debugging

    Get PDF
    In this thesis, several aspects of software debugging from automated crash reproduction to bug report analysis and use of contracts have been studied.Algorithms and the Foundations of Software technolog

    Automated Software Testing of Relational Database Schemas

    Get PDF
    Relational databases are critical for many software systems, holding the most valuable data for organisations. Data engineers build relational databases using schemas to specify the structure of the data within a database and defining integrity constraints. These constraints protect the data's consistency and coherency, leading industry experts to recommend testing them. Since manual schema testing is labour-intensive and error-prone, automated techniques enable the generation of test data. Although these generators are well-established and effective, they use default values and often produce many, long, and similar tests --- this results in decreasing fault detection and increasing regression testing time and testers inspection efforts. It raises the following questions: How effective is the optimised random generator at generating tests and its fault detection compared to prior methods? What factors make tests understandable for testers? How to reduce tests while maintaining effectiveness? How effectively do testers inspect differently reduced tests? To answer these questions, the first contribution of this thesis is to evaluate a new optimised random generator against well-established methods empirically. Secondly, identifying understandability factors of schema tests using a human study. Thirdly, evaluating a novel approach that reduces and merge tests against traditional reduction methods. Finally, studying testers' inspection efforts with differently reduced tests using a human study. The results show that the optimised random method efficiently generates effective tests compared to well-established methods. Testers reported that many NULLs and negative numbers are confusing, and they prefer simple repetition of unimportant values and readable strings. The reduction technique with merging is the most effective at minimising tests and producing efficient tests while maintaining effectiveness compared to traditional methods. The merged tests showed an increase in inspection efficiency with a slight accuracy decrease compared to only reduced tests. Therefore, these techniques and investigations can help practitioners adopt these generators in practice

    Approximation-Refinement Testing of Compute-Intensive Cyber-Physical Models: An Approach Based on System Identification

    Get PDF
    Black-box testing has been extensively applied to test models of Cyber-Physical systems (CPS) since these models are not often amenable to static and symbolic testing and verification. Black-box testing, however, requires to execute the model under test for a large number of candidate test inputs. This poses a challenge for a large and practically-important category of CPS models, known as compute-intensive CPS (CI-CPS) models, where a single simulation may take hours to complete. We propose a novel approach, namely ARIsTEO, to enable effective and efficient testing of CI-CPS models. Our approach embeds black-box testing into an iterative approximation-refinement loop. At the start, some sampled inputs and outputs of the CI-CPS model under test are used to generate a surrogate model that is faster to execute and can be subjected to black-box testing. Any failure-revealing test identified for the surrogate model is checked on the original model. If spurious, the test results are used to refine the surrogate model to be tested again. Otherwise, the test reveals a valid failure. We evaluated ARIsTEO by comparing it with S-Taliro, an open-source and industry-strength tool for testing CPS models. Our results, obtained based on five publicly-available CPS models, show that, on average, ARIsTEO is able to find 24% more requirements violations than S-Taliro and is 31% faster than S-Taliro in finding those violations. We further assessed the effectiveness and efficiency of ARIsTEO on a large industrial case study from the satellite domain. In contrast to S-Taliro, ARIsTEO successfully tested two different versions of this model and could identify three requirements violations, requiring four hours, on average, for each violation

    An Investigation of Search Behaviour in Search-Based Unit Test Generation

    Get PDF
    As software testing is a laborious and error-prone task, automation is desirable. Search-based unit test generation applies evolutionary search algorithms to generate software tests and, in the context of unit testing object-oriented software, Genetic Algorithms (GAs) are frequently employed to generate unit tests that maximise code coverage. Although GAs are effective at generating tests that achieve high code coverage, they are still far from being able to satisfy all test goals (e.g., covering all branches). While some general limitations are known, there is still a lack of understanding of the search behaviour during the optimization, making it difficult to identify the factors that make a search problem difficult. Therefore, this thesis aims to investigate the search behaviour when GAs are applied to generate object-oriented unit tests and, more specifically, identify the reasons why the search fails to achieve the desired test goals. This is achieved by investigating (1) the fitness landscape structure and the impact of its features on the generation of unit tests and (2) the influence of population diversity on generating potential unit tests. Based on the outcome of this investigation, the impact of test case reduction on the landscape features and population diversity is also investigated. Our results reveal that classical indicators for rugged fitness landscapes suggest well searchable problems in the case of unit test generation, but the fitness landscape for most problem instances is dominated by detrimental plateaus. However, increasing diversity does not have a beneficial effect on coverage in general, but it may improve coverage when diversity is promoted adaptively. In fact, increasing diversity has a negative impact on the individual length, which can also be mitigated with the adaptive diversity. Applying the test case reduction seems to be promising in improving the landscape structure and reducing the negative side effects of diversity on length, but have no considerable impact on the search performance

    Improving Readability in Automatic Unit Test Generation

    Get PDF
    In object-oriented programming, quality assurance is commonly provided through writing unit tests, to exercise the operations of each class. If unit tests are created and maintained manually, this can be a time-consuming and laborious task. For this reason, automatic methods are often used to generate tests that seek to cover all paths of the tested code. Search may be guided by criteria that are opaque to the programmer, resulting in test sequences that are long and confusing. This has a negative impact on test maintenance. Once tests have been created, the job is not done: programmers need to reason about the tests throughout the lifecycle, as the tested software units evolve. Maintenance includes diagnosing failing tests (whether due to a software fault or an invalid test) and preserving test oracles (ensuring that checked assertions are still relevant). Programmers also need to understand the tests created for code that they did not write themselves, in order to understand the intent of that code. If generated tests cannot be easily understood, then they will be extremely difficult to maintain. The overall objective of this thesis is to reaffirm the importance of unit test maintenance; and to offer novel techniques to improve the readability of automatically generated tests. The first contribution is an empirical survey of 225 developers from different parts of the world, who were asked to give their opinions about unit testing practices and problems. The survey responses confirm that unit testing is considered important; and that there is an appetite for higher-quality automated test generation, with a view to test maintenance. The second contribution is a domain-specific model of unit test readability, based on human judgements. The model is used to augment automated unit test generation to produce test suites with both high coverage and improved readability. In evaluations, 30 programmers preferred our improved tests and were able to answer maintenance questions 14level of accuracy. The third contribution is a novel algorithm for generating descriptive test names that summarise API- level coverage goals. Test optimisation ensures that each test is short, bears a clear relation to the covered code, and can be readily identified by programmers. In evaluations, 47 programmers agreed with the choice of synthesised names and that these were as descriptive as manually chosen names. Participants were also more accurate and faster at matching generated tests against the tested code, compared to matching with manually-chosen test names
    corecore