3 research outputs found
FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment
Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. To help testing researchers better investigate flakiness, we introduce a test flakiness assessment and experimentation platform, called FlakiMe. FlakiMe supports the seeding of a (controllable) degree of flakiness into the behaviour of a given test suite. Thereby, FlakiMe equips researchers with ways to investigate the impact of test flakiness on their techniques under laboratory-controlled conditions. To demonstrate the application of FlakiMe, we use it to assess the impact of flakiness on mutation testing and program repair (the PRAPR and ARJA methods). These results indicate that a 10% flakiness is sufficient to affect the mutation score, but the effect size is modest (2% - 5%), while it reduces the number of patches produced for repair by 20% up to 100% of repair problems; a devastating impact on this application of testing. Our experiments with FlakiMe demonstrate that flakiness affects different testing applications in very different ways, thereby motivating the need for a laboratory-controllable flakiness impact assessment platform and approach such as FlakiMe
Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests
Software testing assures that code changes do not adversely affect existing
functionality. However, a test case can be flaky, i.e., passing and failing
across executions, even for the same version of the source code. Flaky test
cases introduce overhead to software development as they can lead to
unnecessary attempts to debug production or testing code. The state-of-the-art
ML-based flaky test case predictors rely on pre-defined sets of features that
are either project-specific, require access to production code, which is not
always available to software test engineers. Therefore, in this paper, we
propose Flakify, a black-box, language model-based predictor for flaky test
cases. Flakify relies exclusively on the source code of test cases, thus not
requiring to (a) access to production code (black-box), (b) rerun test cases,
(c) pre-define features. To this end, we employed CodeBERT, a pre-trained
language model, and fine-tuned it to predict flaky test cases using the source
code of test cases. We evaluated Flakify on two publicly available datasets
(FlakeFlagger and IDoFT) for flaky test cases and compared our technique with
the FlakeFlagger approach using two different evaluation procedures:
cross-validation and per-project validation. Flakify achieved high F1-scores on
both datasets using cross-validation and per-project validation, and surpassed
FlakeFlagger by 10 and 18 percentage points in terms of precision and recall,
respectively, when evaluated on the FlakeFlagger dataset, thus reducing the
cost bound to be wasted on unnecessarily debugging test cases and production
code by the same percentages. Flakify also achieved significantly higher
prediction results when used to predict test cases on new projects, suggesting
better generalizability over FlakeFlagger. Our results further show that a
black-box version of FlakeFlagger is not a viable option for predicting flaky
test cases