63 research outputs found

    Automated Unit Testing of Evolving Software

    Get PDF
    As software programs evolve, developers need to ensure that new changes do not affect the originally intended functionality of the program. To increase their confidence, developers commonly write unit tests along with the program, and execute them after a change is made. However, manually writing these unit-tests is difficult and time-consuming, and as their number increases, so does the cost of executing and maintaining them. Automated test generation techniques have been proposed in the literature to assist developers in the endeavour of writing these tests. However, it remains an open question how well these tools can help with fault finding in practice, and maintaining these automatically generated tests may require extra effort compared to human written ones. This thesis evaluates the effectiveness of a number of existing automatic unit test generation techniques at detecting real faults, and explores how these techniques can be improved. In particular, we present a novel multi-objective search-based approach for generating tests that reveal changes across two versions of a program. We then investigate whether these tests can be used such that no maintenance effort is necessary. Our results show that overall, state-of-the-art test generation tools can indeed be effective at detecting real faults: collectively, the tools revealed more than half of the bugs we studied. We also show that our proposed alternative technique that is better suited to the problem of revealing changes, can detect more faults, and does so more frequently. However, we also find that for a majority of object-oriented programs, even a random search can achieve good results. Finally, we show that such change-revealing tests can be generated on demand in practice, without requiring them to be maintained over time

    Improving regression testing efficiency and reliability via test-suite transformations

    Get PDF
    As software becomes more important and ubiquitous, high quality software also becomes crucial. Developers constantly make changes to improve software, and they rely on regression testing—the process of running tests after every change—to ensure that changes do not break existing functionality. Regression testing is widely used both in industry and in open source, but it suffers from two main challenges. (1) Regression testing is costly. Developers run a large number of tests in the test suite after every change, and changes happen very frequently. The cost is both in the time developers spend waiting for the tests to finish running so that developers know whether the changes break existing functionality, and in the monetary cost of running the tests on machines. (2) Regression test suites contain flaky tests, which nondeterministically pass or fail when run on the same version of code, regardless of any changes. Flaky test failures can mislead developers into believing that their changes break existing functionality, even though those tests can fail without any changes. Developers will therefore waste time trying to debug non existent faults in their changes. This dissertation proposes three lines of work that address these challenges of regression testing through test-suite transformations that modify test suites to make them more efficient or more reliable. Specifically, two lines of work explore how to reduce the cost of regression testing and one line of work explores how to fix existing flaky tests. First, this dissertation investigates the effectiveness of test-suite reduction (TSR), a traditional test-suite transformation that removes tests deemed redundant with respect to other tests in the test suite based on heuristics. TSR outputs a smaller, reduced test suite to be run in the future. However, TSR risks removing tests that can potentially detect faults in future changes. While TSR was proposed over two decades ago, it was always evaluated using program versions with seeded faults. Such evaluations do not precisely predict the effectiveness of the reduced test suite on the future changes. This dissertation evaluates TSR in a real-world setting using real software evolution with real test failures. The results show that TSR techniques proposed in the past are not as effective as suggested by traditional TSR metrics, and those same metrics do not predict how effective a reduced test suite is in the future. Researchers need to either propose new TSR techniques that produce more effective reduced test suites or better metrics for predicting the effectiveness of reduced test suites. Second, this dissertation proposes a new transformation to improve regression testing cost when using a modern build system by optimizing the placement of tests, implemented in a technique called TestOptimizer. Modern build systems treat a software project as a group of inter-dependent modules, including test modules that contain only tests. As such, when developers make a change, the build system can use a developer-specified dependency graph among modules to determine which test modules are affected by any changed modules and to run only tests in the affected test modules. However, wasteful test executions are a problem when using build systems this way. Suboptimal placements of tests, where developers may place some tests in a module that has more dependencies than the test actually needs, lead to running more tests than necessary after a change. TestOptimizer analyzes a project and proposes moving tests to reduce the number of test executions that are triggered over time due to developer changes. Evaluation of TestOptimizer on five large proprietary projects at Microsoft shows that the suggested test movements can reduce 21.7 million test executions (17.1%) across all evaluation projects. Developers accepted and intend to implement 84.4% of the reported suggestions. Third, to make regression testing more reliable, this dissertation proposes iFixFlakies, a framework for fixing a prominent kind of flaky tests: order dependent tests. Order-dependent tests pass or fail depending on the order in which the tests are run. Intuitively, order-dependent tests fail either because they need another test to set up the state for them to pass, or because some other test pollutes the state before they are run, and the polluted state makes them fail. The key insight behind iFixFlakies is that test suites often already have tests, which we call helpers, that contain the logic for setting/resetting the state needed for order-dependent tests to pass. iFixFlakies searches a test suite for these helpers and then recommends patches for order-dependent tests using code from the helpers. Evaluation of iFixFlakies on 137 truly order-dependent tests from a public dataset shows that 81 of them have helpers, and iFixFlakies can fix all 81. Furthermore, among our GitHub pull requests for 78 of these order dependent tests (3 of 81 had been already fixed), developers accepted 38; the remaining ones are still pending, and none are rejected so far

    Improvements to Test Case Prioritisation considering Efficiency and Effectiveness on Real Faults

    Get PDF
    Despite the best efforts of programmers and component manufacturers, software does not always work perfectly. In order to guard against this, developers write test suites that execute parts of the code and compare the expected result with the actual result. Over time, test suites become expensive to run for every change, which has led to optimisation techniques such as test case prioritisation. Test case prioritisation reorders test cases within the test suite with the goal of revealing faults as soon as possible. Test case prioritisation has received a lot of research that has indicated that prioritised test suites can reveal faults faster, but due to a lack of real fault repositories available for research, prior evaluations have often been conducted on artificial faults. This thesis aims to investigate whether the use of artificial faults represents a threat to the validity of previous studies, and proposes new strategies for test case prioritisation that increase the effectiveness of test case prioritisation on real faults. This thesis conducts an empirical evaluation of existing test case prioritisation strategies on real and artificial faults, which establishes that artificial faults provide unreliable results for real faults. The study found that there are four occasions on which a strategy for test case prioritisation would be considered no better than the baseline when using one fault type, but would be considered a significant improvement over the baseline when using the other. Moreover, this evaluation reveals that existing test case prioritisation strategies perform poorly on real faults, with no strategies significantly outperforming the baseline. Given the need to improve test case prioritisation strategies for real faults, this thesis proceeds to consider other techniques that have been shown to be effective on real faults. One such technique is defect prediction, a technique that provides estimates that a class contains a fault. This thesis proposes a test case prioritisation strategy, called G-Clef, that leverages defect prediction estimates to reorder test suites. While the evaluation of G-Clef indicates that it outperforms existing test case prioritisation strategies, the average predicted location of a faulty class is 13% of all classes in a system, which shows potential for improvement. Finally, this thesis conducts an investigative study as to whether sentiments expressed in commit messages could be used to improve the defect prediction element of G-Clef. Throughout the course of this PhD, I have created a tool called Kanonizo, an open-source tool for performing test case prioritisation on Java programs. All of the experiments and strategies used in this thesis were implemented into Kanonizo


    Get PDF
    In modern software development practices, testing activities must be carried out frequently and preferably after each code change to bring confidence in anticipated system behaviour and, more importantly, to avoid introducing faults. When it comes to software testing, it is not only about what we are expecting; it is equally about what we are not expecting. Developers desire to test and assess the testing adequacy of the delta of behaviours between stable and modified software versions. Many test adequacy criteria have been proposed through the years, yet very few have been placed for continuous development. Among all proposed, one has been empirically verified to be the most effective in finding faults and evaluating test adequacy. Mutation Testing has been widely studied, but its current traditional form is impractical to keep up with the rapid pace of modern software development standards and code evolution due to a large number of test requirements, i.e., mutants. This dissertation proposes change-aware mutation testing, a novel approach that points to relevant change-aware test requirements, allows reasoning to what extent code modification is tested and captures behavioural relations of changed and unchanged code from which faults often arise. In particular, this dissertation builds contributions around challenges related to the code-mutants' behavioural properties, testing regular code modifications and mutants' fault detection effectiveness. First, this dissertation examines the ability of the mutants to capture the behaviour of regression faults and evaluates the relationship between the syntactic and semantic distance metrics often used to capture mutant-real fault similarity. Second, this dissertation proposes a commit-aware mutation testing approach that focuses rather on change-aware mutants that bring significant values in capturing regression faults. The approach shows 30\% higher fault detection in comparison with baselines and sheds light on the suitability of commit-aware mutation testing in the context of evolving systems. Third, this dissertation proposes the usage of high-order mutations to identify change-impacted mutants, resulting in the most extensive dataset, to date, of commit-relevant mutants, which are further thoroughly studied to provide the understanding and elicit properties of this particular novel category. The studies led to the discovery of long-standing mutants, demonstrated as suitable to maintain a high-quality test suite for a series of code releases. Fourth, this dissertation proposes the usage of learning-based mutant selection strategies when questioning how effective are the mutants of fundamentally different mutation generation approaches in finding faults. The outcomes raise awareness of the risk that the suitability of different kinds of mutants can be misinterpreted if not using intelligent approaches to remove the noise of impractical mutants. Overall, this dissertation proposes a novel change-aware testing approach and provides insights for software testing gatekeepers towards more effective mutation testing in the context of continuously evolving systems

    MiSFIT: Mining Software Fault Information and Types

    Get PDF
    As software becomes more important to society, the number, age, and complexity of systems grow. Software organizations require continuous process improvement to maintain the reliability, security, and quality of these software systems. Software organizations can utilize data from manual fault classification to meet their process improvement needs, but organizations lack the expertise or resources to implement them correctly. This dissertation addresses the need for the automation of software fault classification. Validation results show that automated fault classification, as implemented in the MiSFIT tool, can group faults of similar nature. The resulting classifications result in good agreement for common software faults with no manual effort. To evaluate the method and tool, I develop and apply an extended change taxonomy to classify the source code changes that repaired software faults from an open source project. MiSFIT clusters the faults based on the changes. I manually inspect a random sample of faults from each cluster to validate the results. The automatically classified faults are used to analyze the evolution of a software application over seven major releases. The contributions of this dissertation are an extended change taxonomy for software fault analysis, a method to cluster faults by the syntax of the repair, empirical evidence that fault distribution varies according to the purpose of the module, and the identification of project-specific trends from the analysis of the changes

    Code Coverage Measurement and Fault Localization Approaches

    Get PDF
    Code coverage measurement plays an important role in white-box testing, both in industrial practice and academic research. Several areas are highly dependent on code coverage as well, including test case generation, test prioritization, fault localization, and others. Out of these areas, this dissertation focuses on two main topics, and the thesis points are divided into two parts accordingly. The first part consists of one thesis point that discusses the differences between methods for measuring code coverage in Java and the effects of these differences. The second part focuses on a fault localization technique called spectrum-based fault localization that utilizes code coverage to estimate the risk of each program element being faulty. More specifically, the corresponding two thesis points are discussing the improvement of the efficiency of spectrum-based approaches by incorporating external information, e.g., users’ knowledge, and context data extracted from call chains

    How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults

    Get PDF
    Mutation analysis is a well-studied, fault-based testing technique. It requires testers to design tests based on a set of artificial defects. The defects help in performing testing activities by measuring the ratio that is revealed by the candidate tests. Unfortunately, applying mutation to real-world programs requires automated tools due to the vast number of defects involved. In such a case, the effectiveness of the method strongly depends on the peculiarities of the employed tools. Thus, when using automated tools, their implementation inadequacies can lead to inaccurate results. To deal with this issue, we cross-evaluate four mutation testing tools for Java, namely PIT, muJava, Major and the research version of PIT, PITRV, with respect to their fault-detection capabilities. We investigate the strengths of the tools based on: a) a set of real faults and b) manual analysis of the mutants they introduce. We find that there are large differences between the tools’ effectiveness and demonstrate that no tool is able to subsume the others. We also provide results indicating the application cost of the method. Overall, we find that PITRV achieves the best results. In particular, PITRV outperforms the other tools by finding 6% more faults than the other tools combined

    2019 EC3 July 10-12, 2019 Chania, Crete, Greece

    Get PDF


    Get PDF
    As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them. This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems
    • …