66 research outputs found

    FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment

    Get PDF
    Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. To help testing researchers better investigate flakiness, we introduce a test flakiness assessment and experimentation platform, called FlakiMe. FlakiMe supports the seeding of a (controllable) degree of flakiness into the behaviour of a given test suite. Thereby, FlakiMe equips researchers with ways to investigate the impact of test flakiness on their techniques under laboratory-controlled conditions. To demonstrate the application of FlakiMe, we use it to assess the impact of flakiness on mutation testing and program repair (the PRAPR and ARJA methods). These results indicate that a 10% flakiness is sufficient to affect the mutation score, but the effect size is modest (2% - 5%), while it reduces the number of patches produced for repair by 20% up to 100% of repair problems; a devastating impact on this application of testing. Our experiments with FlakiMe demonstrate that flakiness affects different testing applications in very different ways, thereby motivating the need for a laboratory-controllable flakiness impact assessment platform and approach such as FlakiMe

    Improving regression testing efficiency and reliability via test-suite transformations

    Get PDF
    As software becomes more important and ubiquitous, high quality software also becomes crucial. Developers constantly make changes to improve software, and they rely on regression testing—the process of running tests after every change—to ensure that changes do not break existing functionality. Regression testing is widely used both in industry and in open source, but it suffers from two main challenges. (1) Regression testing is costly. Developers run a large number of tests in the test suite after every change, and changes happen very frequently. The cost is both in the time developers spend waiting for the tests to finish running so that developers know whether the changes break existing functionality, and in the monetary cost of running the tests on machines. (2) Regression test suites contain flaky tests, which nondeterministically pass or fail when run on the same version of code, regardless of any changes. Flaky test failures can mislead developers into believing that their changes break existing functionality, even though those tests can fail without any changes. Developers will therefore waste time trying to debug non existent faults in their changes. This dissertation proposes three lines of work that address these challenges of regression testing through test-suite transformations that modify test suites to make them more efficient or more reliable. Specifically, two lines of work explore how to reduce the cost of regression testing and one line of work explores how to fix existing flaky tests. First, this dissertation investigates the effectiveness of test-suite reduction (TSR), a traditional test-suite transformation that removes tests deemed redundant with respect to other tests in the test suite based on heuristics. TSR outputs a smaller, reduced test suite to be run in the future. However, TSR risks removing tests that can potentially detect faults in future changes. While TSR was proposed over two decades ago, it was always evaluated using program versions with seeded faults. Such evaluations do not precisely predict the effectiveness of the reduced test suite on the future changes. This dissertation evaluates TSR in a real-world setting using real software evolution with real test failures. The results show that TSR techniques proposed in the past are not as effective as suggested by traditional TSR metrics, and those same metrics do not predict how effective a reduced test suite is in the future. Researchers need to either propose new TSR techniques that produce more effective reduced test suites or better metrics for predicting the effectiveness of reduced test suites. Second, this dissertation proposes a new transformation to improve regression testing cost when using a modern build system by optimizing the placement of tests, implemented in a technique called TestOptimizer. Modern build systems treat a software project as a group of inter-dependent modules, including test modules that contain only tests. As such, when developers make a change, the build system can use a developer-specified dependency graph among modules to determine which test modules are affected by any changed modules and to run only tests in the affected test modules. However, wasteful test executions are a problem when using build systems this way. Suboptimal placements of tests, where developers may place some tests in a module that has more dependencies than the test actually needs, lead to running more tests than necessary after a change. TestOptimizer analyzes a project and proposes moving tests to reduce the number of test executions that are triggered over time due to developer changes. Evaluation of TestOptimizer on five large proprietary projects at Microsoft shows that the suggested test movements can reduce 21.7 million test executions (17.1%) across all evaluation projects. Developers accepted and intend to implement 84.4% of the reported suggestions. Third, to make regression testing more reliable, this dissertation proposes iFixFlakies, a framework for fixing a prominent kind of flaky tests: order dependent tests. Order-dependent tests pass or fail depending on the order in which the tests are run. Intuitively, order-dependent tests fail either because they need another test to set up the state for them to pass, or because some other test pollutes the state before they are run, and the polluted state makes them fail. The key insight behind iFixFlakies is that test suites often already have tests, which we call helpers, that contain the logic for setting/resetting the state needed for order-dependent tests to pass. iFixFlakies searches a test suite for these helpers and then recommends patches for order-dependent tests using code from the helpers. Evaluation of iFixFlakies on 137 truly order-dependent tests from a public dataset shows that 81 of them have helpers, and iFixFlakies can fix all 81. Furthermore, among our GitHub pull requests for 78 of these order dependent tests (3 of 81 had been already fixed), developers accepted 38; the remaining ones are still pending, and none are rejected so far

    FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests

    Get PDF
    Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by developers and hinder the credibility of test suites. Thus, flaky tests have caught the attention of researchers in recent years. Numerous approaches have been published on defining, locating, and categorizing flaky tests, along with auto-repairing strategies for specific types of flakiness. Practitioners have developed several techniques to detect flaky tests automatically. The most traditional approaches adopt repeated execution of test suites accompanied by techniques such as shuffled execution order, and random distortion of environment. State-of-the-art research also incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, strategies for repairing flaky tests have also been published for specific flaky test categories and the process has been automated as well. However, there is a research gap between flaky test detection and category-specific flakiness repair. To address the aforementioned gap, this thesis proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate categorization of a given flaky test case. FlaKat first parses and converts raw flaky tests into vector embeddings. The dimensionality of embeddings is reduced and then used for training machine learning classifiers. Sampling techniques are applied to address the imbalance between flaky test categories in the dataset. The evaluation of FlaKat was conducted to determine its performance with different combinations of configurations using known flaky tests from 108 open-source Java projects. Notably, Implementation-Dependent and Order-Dependent flaky tests, which represent almost 75% of the total dataset, achieved F1 scores (harmonic mean of precision and recall) of 0.94 and 0.90 respectively while the overall macro average (no weight difference between categories) is at 0.67. This research work also proposes a new evaluation metric, called Flakiness Detection Capacity (FDC), for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final obtained results for FDC also aligns with F1 score regarding which classifier yields the best flakiness classification

    Understanding and Mitigating Flaky Software Test Cases

    Get PDF
    A flaky test is a test case that can pass or fail without changes to the test case code or the code under test. They are a wide-spread problem with serious consequences for developers and researchers alike. For developers, flaky tests lead to time wasted debugging spurious failures, tempting them to ignore future failures. While unreliable, flaky tests can still indicate genuine issues in the code under test, so ignoring them can lead to bugs being missed. The non-deterministic behaviour of flaky tests is also a major snag to continuous integration, where a single flaky test can fail an entire build. For researchers, flaky tests challenge the assumption that a test failure implies a bug, an assumption that many fundamental techniques in software engineering research rely upon, including test acceleration, mutation testing, and fault localisation. Despite increasing research interest in the topic, open problems remain. In particular, there has been relatively little attention paid to the views and experiences of developers, despite a considerable body of empirical work. This is essential to guide the focus of research into areas that are most likely to be beneficial to the software engineering industry. Furthermore, previous automated techniques for detecting flaky tests are typically either based on exhaustively rerunning test cases or machine learning classifiers. The prohibitive runtime of the rerunning approach and the demonstrably poor inter-project generalisability of classifiers leaves practitioners with a stark choice when it comes to automatically detecting flaky tests. In response to these challenges, I set two high-level goals for this thesis: (1) to enhance the understanding of the manifestation, causes, and impacts of flaky tests; and (2) to develop and empirically evaluate efficient automated techniques for mitigating flaky tests. In pursuit of these goals, this thesis makes five contributions: (1) a comprehensive systematic literature review of 76 published papers; (2) a literature-guided survey of 170 professional software developers; (3) a new feature set for encoding test cases in machine learning-based flaky test detection; (4) a novel approach for reducing the time cost of rerunning-based techniques for detecting flaky tests by combining them with machine learning classifiers; and (5) an automated technique that detects and classifies existing flaky tests in a project and produces reusable project-specific machine learning classifiers able to provide fast and accurate predictions for future test cases in that project

    Test Flakiness Prediction Techniques for Evolving Software Systems

    Get PDF

    PEELER: Learning to Effectively Predict Flakiness without Running Tests

    Get PDF
    —Regression testing is a widely adopted approach to expose change-induced bugs as well as to verify the correctness/robustness of code in modern software development settings. Unfortunately, the occurrence of flaky tests leads to a significant increase in the cost of regression testing and eventually reduces the productivity of developers (i.e., their ability to find and fix real problems). State-of-the-art approaches leverage dynamic test information obtained through expensive re-execution of test cases to effectively identify flaky tests. Towards accounting for scalability constraints, some recent approaches have built on static test case features, but fall short on effectiveness. In this paper, we introduce PEELER, a new fully static approach for predicting flaky tests through exploring a representation of test cases based on the data dependency relations. The predictor is then trained as a neural network based model, which achieves at the same time scalability (because it does not require any test execution), effectiveness (because it exploits relevant test dependency features), and practicality (because it can be applied in the wild to find new flaky tests). Experimental validation on 17,532 test cases from 21 Java projects shows that PEELER outperforms the state-of-the-art FlakeFlagger by around 20 percentage points: we catch 22% more flaky tests while yielding 51% less false positives. Finally, in a live study with projects in-the-wild, we reported to developers 21 flakiness cases, among which 12 have already been confirmed by developers as being indeed flaky

    Assessing the Effectiveness of Defect Prediction-based Test Suites at Localizing Faults

    Get PDF
    Debugging a software program constitutes a significant and laborious task for programmers, often consuming a substantial amount of time. The need to identify faulty lines of code further compounds this challenge, leading to decreased overall productivity. Consequently, the development of automated tools for fault detection becomes imperative to streamline the debugging process and enhance programmer productivity. In recent years, the field of automatic test generation has witnessed remarkable advancements, significantly improving the efficacy of automatic tests in detecting faults. The localization of faults can be further optimized through the utilization of such sophisticated tools. This dissertation aims to conduct an experimental study that assembles specialized automatic test generation tools designed to detect faults by estimating the likelihood of code being faulty. These tools will be compared against each other to discern their relative performance and effectiveness. Additionally, the study will comprehensively compare developer-generated tests with automatically generated tests to evaluate their respective aptitude for fault detection. Through this investigation, we seek to identify the most effective automated test generation tool while providing valuable insights into the relative merits of developer-generated and automatically generated tests for fault detection

    How effective are mutation testing tools? An empirical analysis of Java mutation testing tools with manual analysis and real faults

    Get PDF
    Mutation analysis is a well-studied, fault-based testing technique. It requires testers to design tests based on a set of artificial defects. The defects help in performing testing activities by measuring the ratio that is revealed by the candidate tests. Unfortunately, applying mutation to real-world programs requires automated tools due to the vast number of defects involved. In such a case, the effectiveness of the method strongly depends on the peculiarities of the employed tools. Thus, when using automated tools, their implementation inadequacies can lead to inaccurate results. To deal with this issue, we cross-evaluate four mutation testing tools for Java, namely PIT, muJava, Major and the research version of PIT, PITRV, with respect to their fault-detection capabilities. We investigate the strengths of the tools based on: a) a set of real faults and b) manual analysis of the mutants they introduce. We find that there are large differences between the tools’ effectiveness and demonstrate that no tool is able to subsume the others. We also provide results indicating the application cost of the method. Overall, we find that PITRV achieves the best results. In particular, PITRV outperforms the other tools by finding 6% more faults than the other tools combined
    • …
    corecore