4 research outputs found

    Improving regression testing efficiency and reliability via test-suite transformations

    Get PDF
    As software becomes more important and ubiquitous, high quality software also becomes crucial. Developers constantly make changes to improve software, and they rely on regression testing—the process of running tests after every change—to ensure that changes do not break existing functionality. Regression testing is widely used both in industry and in open source, but it suffers from two main challenges. (1) Regression testing is costly. Developers run a large number of tests in the test suite after every change, and changes happen very frequently. The cost is both in the time developers spend waiting for the tests to finish running so that developers know whether the changes break existing functionality, and in the monetary cost of running the tests on machines. (2) Regression test suites contain flaky tests, which nondeterministically pass or fail when run on the same version of code, regardless of any changes. Flaky test failures can mislead developers into believing that their changes break existing functionality, even though those tests can fail without any changes. Developers will therefore waste time trying to debug non existent faults in their changes. This dissertation proposes three lines of work that address these challenges of regression testing through test-suite transformations that modify test suites to make them more efficient or more reliable. Specifically, two lines of work explore how to reduce the cost of regression testing and one line of work explores how to fix existing flaky tests. First, this dissertation investigates the effectiveness of test-suite reduction (TSR), a traditional test-suite transformation that removes tests deemed redundant with respect to other tests in the test suite based on heuristics. TSR outputs a smaller, reduced test suite to be run in the future. However, TSR risks removing tests that can potentially detect faults in future changes. While TSR was proposed over two decades ago, it was always evaluated using program versions with seeded faults. Such evaluations do not precisely predict the effectiveness of the reduced test suite on the future changes. This dissertation evaluates TSR in a real-world setting using real software evolution with real test failures. The results show that TSR techniques proposed in the past are not as effective as suggested by traditional TSR metrics, and those same metrics do not predict how effective a reduced test suite is in the future. Researchers need to either propose new TSR techniques that produce more effective reduced test suites or better metrics for predicting the effectiveness of reduced test suites. Second, this dissertation proposes a new transformation to improve regression testing cost when using a modern build system by optimizing the placement of tests, implemented in a technique called TestOptimizer. Modern build systems treat a software project as a group of inter-dependent modules, including test modules that contain only tests. As such, when developers make a change, the build system can use a developer-specified dependency graph among modules to determine which test modules are affected by any changed modules and to run only tests in the affected test modules. However, wasteful test executions are a problem when using build systems this way. Suboptimal placements of tests, where developers may place some tests in a module that has more dependencies than the test actually needs, lead to running more tests than necessary after a change. TestOptimizer analyzes a project and proposes moving tests to reduce the number of test executions that are triggered over time due to developer changes. Evaluation of TestOptimizer on five large proprietary projects at Microsoft shows that the suggested test movements can reduce 21.7 million test executions (17.1%) across all evaluation projects. Developers accepted and intend to implement 84.4% of the reported suggestions. Third, to make regression testing more reliable, this dissertation proposes iFixFlakies, a framework for fixing a prominent kind of flaky tests: order dependent tests. Order-dependent tests pass or fail depending on the order in which the tests are run. Intuitively, order-dependent tests fail either because they need another test to set up the state for them to pass, or because some other test pollutes the state before they are run, and the polluted state makes them fail. The key insight behind iFixFlakies is that test suites often already have tests, which we call helpers, that contain the logic for setting/resetting the state needed for order-dependent tests to pass. iFixFlakies searches a test suite for these helpers and then recommends patches for order-dependent tests using code from the helpers. Evaluation of iFixFlakies on 137 truly order-dependent tests from a public dataset shows that 81 of them have helpers, and iFixFlakies can fix all 81. Furthermore, among our GitHub pull requests for 78 of these order dependent tests (3 of 81 had been already fixed), developers accepted 38; the remaining ones are still pending, and none are rejected so far

    Assessing the Efficacy of Test Selection, Prioritization, and Batching Strategies in the Presence of Flaky Tests and Parallel Execution at Scale

    Get PDF
    Effective software testing is essential for successful software releases, and numerous test optimization techniques have been proposed to enhance this process. However, existing research primarily concentrates on small datasets, resulting in impractical solutions for large-scale projects. Flaky tests, which significantly affect test optimization results, are often overlooked, and unrealistic approaches are employed to identify them. Furthermore, there is limited research on the impact of parallelization on test optimization techniques, particularly batching, and a lack of comprehensive comparisons among different techniques, including batching, which is an effective but often neglected approach. To address research gaps, we analyzed the Chrome release process and collected a dataset of 276 million test results. In addition to evaluating established test optimization algorithms, we introduced two new algorithms. We also examined the impact of parallelism by varying the number of machines used. Our assessment covered various metrics, including feedback time, failing test detection speed, test execution time, and machine utilization. Our investigation reveals that a significant portion of failures in testing is attributed to flaky tests, resulting in an inflated performance of test prioritization algorithms. Additionally, we observed that test parallelization has a non-linear impact on feedback time, as delays accumulate throughout the entire test queue. When it comes to optimizing feedback time, batching algorithms with adaptive batch sizes prove to be more effective compared to those with constant batch sizes, achieving execution reductions of up to 91%. Furthermore, our findings indicate that the batching technique is on par with the test selection algorithm in terms of effectiveness, while maintaining the advantage of not missing any failures. Practitioners are encouraged to adopt adaptive batching techniques to minimize the number of machines required for testing and reduce feedback time, while effectively managing flaky tests. Analyzing historical data is crucial for determining the threshold at which adding more machines has minimal impact on feedback time, enabling optimization of testing efficiency and resource utilization
    corecore