14 research outputs found

    Regression Test Selection Tool for Python in Continuous Integration Process

    Get PDF
    In this paper, we present a coverage-based regression test selection (RTS) approach and a developed tool for Python. The tool can be used either on a developer's machine or on build servers. A special characteristic of the tool is the attention to easy integration to continuous integration and deployment. To evaluate the performance of the proposed approach, mutation testing is applied to three open-source projects, and the results of the execution of full test suites are compared to the execution of a set of tests selected by the tool. The missed fault rate of the test selection varies between 0-2% at file-level granularity and 16-24% at line-level granularity. The high missed fault rate at the line-level granularity is related to the selected basic mutation approach and the result could be improved with advanced mutation techniques. Depending on the target optimization metric (time or precision) in DevOps/MLOps process the error rate could be acceptable or further improved by using file-level granularity based test selection.Peer reviewe

    Empirical comparison of four Java-based regression test selection techniques, An

    Get PDF
    2020 Fall.Includes bibliographical references.Regression testing is crucial to ensure that previously tested functionality is not broken by additions, modifications, and deletions to the program code. Since regression testing is an expensive process, researchers have developed regression test selection (RTS) techniques, which select and execute only those test cases that are impacted by the code changes. In general, an RTS technique has two main activities, which are (1) determining dependencies between the source code and test cases, and (2) identifying the code changes. Different approaches exist in the research literature to compute dependencies statically or dynamically at different levels of granularity. Also, code changes can be identified at different levels of granularity using different techniques. As a result, RTS techniques possess different characteristics related to the amount of reduction in the test suite size, time to select and run the test cases, test selection accuracy, and fault detection ability of the selected subset of test cases. Researchers have empirically evaluated the RTS techniques, but the evaluations were generally conducted using different experimental settings. This thesis compares four recent Java-based RTS techniques, Ekstazi, HyRTS, OpenClover, and STARTS, with respect to the above-mentioned characteristics using multiple revisions from five open source projects. It investigates the relationship between four program features and the performance of RTS techniques: total (program and test suite) size in KLOC, total number of classes, percentage of test classes over the total number of classes, and the percentage of classes that changed between revisions. The results show that STARTS, a static RTS technique, over-estimates dependencies between test cases and program code, and thus, selects more test cases than the dynamic RTS techniques Ekstazi and HyRTS, even though all three identify code changes in the same way. OpenClover identifies code changes differently from Ekstazi, HyRTS, and STARTS, and selects more test cases. STARTS achieved the lowest safety violation with respect to Ekstazi, and HyRTS achieved the lowest precision violation with respect to both STARTS and Ekstazi. Overall, the average fault detection ability of the RTS techniques was 8.75% lower than that of the original test suite. STARTS, Ekstazi, and HyRTS achieved higher test suite size reduction on the projects with over 100 KLOC than those with less than 100 KLOC. OpenClover achieved a higher test suite size reduction in the subjects that had a fewer total number of classes. The time reduction of OpenClover is affected by the combination of the number of source classes and the number of test cases in the subjects. The higher the number of test cases and source classes, the lower the time reduction

    Predicting unstable software benchmarks using static source code features

    Full text link
    Software benchmarks are only as good as the performance measurements they yield. Unstable benchmarks show high variability among repeated measurements, which causes uncertainty about the actual performance and complicates reliable change assessment. However, if a benchmark is stable or unstable only becomes evident after it has been executed and its results are available. In this paper, we introduce a machine-learning-based approach to predict a benchmark’s stability without having to execute it. Our approach relies on 58 statically-computed source code features, extracted for benchmark code and code called by a benchmark, related to (1) meta information, e.g., lines of code (LOC), (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network input/output (I/O). To assess our approach’s effectiveness, we perform a large-scale experiment on 4,461 Go benchmarks coming from 230 open-source software (OSS) projects. First, we assess the prediction performance of our machine learning models using 11 binary classification algorithms. We find that Random Forest performs best with good prediction performance from 0.79 to 0.90, and 0.43 to 0.68, in terms of AUC and MCC, respectively. Second, we perform feature importance analyses for individual features and feature categories. We find that 7 features related to meta-information, slice usage, nested loops, and synchronization application programming interfaces (APIs) are individually important for good predictions; and that the combination of all features of the called source code is paramount for our model, while the combination of features of the benchmark itself is less important. Our results show that although benchmark stability is affected by more than just the source code, we can effectively utilize machine learning models to predict whether a benchmark will be stable or not ahead of execution. This enables spending precious testing time on reliable benchmarks, supporting developers to identify unstable benchmarks during development, allowing unstable benchmarks to be repeated more often, estimating stability in scenarios where repeated benchmark execution is infeasible or impossible, and warning developers if new benchmarks or existing benchmarks executed in new environments will be unstable

    Improving regression testing efficiency and reliability via test-suite transformations

    Get PDF
    As software becomes more important and ubiquitous, high quality software also becomes crucial. Developers constantly make changes to improve software, and they rely on regression testing—the process of running tests after every change—to ensure that changes do not break existing functionality. Regression testing is widely used both in industry and in open source, but it suffers from two main challenges. (1) Regression testing is costly. Developers run a large number of tests in the test suite after every change, and changes happen very frequently. The cost is both in the time developers spend waiting for the tests to finish running so that developers know whether the changes break existing functionality, and in the monetary cost of running the tests on machines. (2) Regression test suites contain flaky tests, which nondeterministically pass or fail when run on the same version of code, regardless of any changes. Flaky test failures can mislead developers into believing that their changes break existing functionality, even though those tests can fail without any changes. Developers will therefore waste time trying to debug non existent faults in their changes. This dissertation proposes three lines of work that address these challenges of regression testing through test-suite transformations that modify test suites to make them more efficient or more reliable. Specifically, two lines of work explore how to reduce the cost of regression testing and one line of work explores how to fix existing flaky tests. First, this dissertation investigates the effectiveness of test-suite reduction (TSR), a traditional test-suite transformation that removes tests deemed redundant with respect to other tests in the test suite based on heuristics. TSR outputs a smaller, reduced test suite to be run in the future. However, TSR risks removing tests that can potentially detect faults in future changes. While TSR was proposed over two decades ago, it was always evaluated using program versions with seeded faults. Such evaluations do not precisely predict the effectiveness of the reduced test suite on the future changes. This dissertation evaluates TSR in a real-world setting using real software evolution with real test failures. The results show that TSR techniques proposed in the past are not as effective as suggested by traditional TSR metrics, and those same metrics do not predict how effective a reduced test suite is in the future. Researchers need to either propose new TSR techniques that produce more effective reduced test suites or better metrics for predicting the effectiveness of reduced test suites. Second, this dissertation proposes a new transformation to improve regression testing cost when using a modern build system by optimizing the placement of tests, implemented in a technique called TestOptimizer. Modern build systems treat a software project as a group of inter-dependent modules, including test modules that contain only tests. As such, when developers make a change, the build system can use a developer-specified dependency graph among modules to determine which test modules are affected by any changed modules and to run only tests in the affected test modules. However, wasteful test executions are a problem when using build systems this way. Suboptimal placements of tests, where developers may place some tests in a module that has more dependencies than the test actually needs, lead to running more tests than necessary after a change. TestOptimizer analyzes a project and proposes moving tests to reduce the number of test executions that are triggered over time due to developer changes. Evaluation of TestOptimizer on five large proprietary projects at Microsoft shows that the suggested test movements can reduce 21.7 million test executions (17.1%) across all evaluation projects. Developers accepted and intend to implement 84.4% of the reported suggestions. Third, to make regression testing more reliable, this dissertation proposes iFixFlakies, a framework for fixing a prominent kind of flaky tests: order dependent tests. Order-dependent tests pass or fail depending on the order in which the tests are run. Intuitively, order-dependent tests fail either because they need another test to set up the state for them to pass, or because some other test pollutes the state before they are run, and the polluted state makes them fail. The key insight behind iFixFlakies is that test suites often already have tests, which we call helpers, that contain the logic for setting/resetting the state needed for order-dependent tests to pass. iFixFlakies searches a test suite for these helpers and then recommends patches for order-dependent tests using code from the helpers. Evaluation of iFixFlakies on 137 truly order-dependent tests from a public dataset shows that 81 of them have helpers, and iFixFlakies can fix all 81. Furthermore, among our GitHub pull requests for 78 of these order dependent tests (3 of 81 had been already fixed), developers accepted 38; the remaining ones are still pending, and none are rejected so far