59 research outputs found
PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection
Bug datasets are vital for enabling deep learning techniques to address
software maintenance tasks related to bugs. However, existing bug datasets
suffer from precise and scale limitations: they are either small-scale but
precise with manual validation or large-scale but imprecise with simple commit
message processing. In this paper, we introduce PreciseBugCollector, a precise,
multi-language bug collection approach that overcomes these two limitations.
PreciseBugCollector is based on two novel components: a) A bug tracker to map
the codebase repositories with external bug repositories to trace bug type
information, and b) A bug injector to generate project-specific bugs by
injecting noise into the correct codebases and then executing them against
their test suites to obtain test failure messages.
We implement PreciseBugCollector against three sources: 1) A bug tracker that
links to the national vulnerability data set (NVD) to collect general-wise
vulnerabilities, 2) A bug tracker that links to OSS-Fuzz to collect
general-wise bugs, and 3) A bug injector based on 16 injection rules to
generate project-wise bugs. To date, PreciseBugCollector comprises 1057818 bugs
extracted from 2968 open-source projects. Of these, 12602 bugs are sourced from
bug repositories (NVD and OSS-Fuzz), while the remaining 1045216
project-specific bugs are generated by the bug injector. Considering the
challenge objectives, we argue that a bug injection approach is highly valuable
for the industrial setting, since project-specific bugs align with domain
knowledge, share the same codebase, and adhere to the coding style employed in
industrial projects.Comment: Accepted at the industry challenge track of ASE 202
Empirical Study of Restarted and Flaky Builds on Travis CI
Continuous Integration (CI) is a development practice where developers
frequently integrate code into a common codebase. After the code is integrated,
the CI server runs a test suite and other tools to produce a set of reports
(e.g., output of linters and tests). If the result of a CI test run is
unexpected, developers have the option to manually restart the build,
re-running the same test suite on the same code; this can reveal build
flakiness, if the restarted build outcome differs from the original build. In
this study, we analyze restarted builds, flaky builds, and their impact on the
development workflow. We observe that developers restart at least 1.72% of
builds, amounting to 56,522 restarted builds in our Travis CI dataset. We
observe that more mature and more complex projects are more likely to include
restarted builds. The restarted builds are mostly builds that are initially
failing due to a test, network problem, or a Travis CI limitations such as
execution timeout. Finally, we observe that restarted builds have a major
impact on development workflow. Indeed, in 54.42% of the restarted builds, the
developers analyze and restart a build within an hour of the initial failure.
This suggests that developers wait for CI results, interrupting their workflow
to address the issue. Restarted builds also slow down the merging of pull
requests by a factor of three, bringing median merging time from 16h to 48h
Contextual Predictive Mutation Testing
Mutation testing is a powerful technique for assessing and improving test
suite quality that artificially introduces bugs and checks whether the test
suites catch them. However, it is also computationally expensive and thus does
not scale to large systems and projects. One promising recent approach to
tackling this scalability problem uses machine learning to predict whether the
tests will detect the synthetic bugs, without actually running those tests.
However, existing predictive mutation testing approaches still misclassify 33%
of detection outcomes on a randomly sampled set of mutant-test suite pairs. We
introduce MutationBERT, an approach for predictive mutation testing that
simultaneously encodes the source method mutation and test method, capturing
key context in the input representation. Thanks to its higher precision,
MutationBERT saves 33% of the time spent by a prior approach on
checking/verifying live mutants. MutationBERT, also outperforms the
state-of-the-art in both same project and cross project settings, with
meaningful improvements in precision, recall, and F1 score. We validate our
input representation, and aggregation approaches for lifting predictions from
the test matrix level to the test suite level, finding similar improvements in
performance. MutationBERT not only enhances the state-of-the-art in predictive
mutation testing, but also presents practical benefits for real-world
applications, both in saving developer time and finding hard to detect mutants
Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts
An "adequate" test suite should effectively find all inconsistencies between
a system's requirements/specifications and its implementation. Practitioners
frequently use code coverage to approximate adequacy, while academics argue
that mutation score may better approximate true (oracular) adequacy coverage.
High code coverage is increasingly attainable even on large systems via
automatic test generation, including fuzzing. In light of all of these options
for measuring and improving testing effort, how should a QA engineer spend
their time? We propose a new framework for reasoning about the extent, limits,
and nature of a given testing effort based on an idea we call the oracle gap,
or the difference between source code coverage and mutation score for a given
software element. We conduct (1) a large-scale observational study of the
oracle gap across popular Maven projects, (2) a study that varies testing and
oracle quality across several of those projects and (3) a small-scale
observational study of highly critical, well-tested code across comparable
blockchain projects. We show that the oracle gap surfaces important information
about the extent and quality of a test effort beyond either adequacy metric
alone. In particular, it provides a way for practitioners to identify source
files where it is likely a weak oracle tests important code
- …