7,960 research outputs found
You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems
Properly benchmarking Automated Program Repair (APR) systems should
contribute to the development and adoption of the research outputs by
practitioners. To that end, the research community must ensure that it reaches
significant milestones by reliably comparing state-of-the-art tools for a
better understanding of their strengths and weaknesses. In this work, we
identify and investigate a practical bias caused by the fault localization (FL)
step in a repair pipeline. We propose to highlight the different fault
localization configurations used in the literature, and their impact on APR
systems when applied to the Defects4J benchmark. Then, we explore the
performance variations that can be achieved by `tweaking' the FL step.
Eventually, we expect to create a new momentum for (1) full disclosure of APR
experimental procedures with respect to FL, (2) realistic expectations of
repairing bugs in Defects4J, as well as (3) reliable performance comparison
among the state-of-the-art APR systems, and against the baseline performance
results of our thoroughly assessed kPAR repair tool. Our main findings include:
(a) only a subset of Defects4J bugs can be currently localized by commonly-used
FL techniques; (b) current practice of comparing state-of-the-art APR systems
(i.e., counting the number of fixed bugs) is potentially misleading due to the
bias of FL configurations; and (c) APR authors do not properly qualify their
performance achievement with respect to the different tuning parameters
implemented in APR systems.Comment: Accepted by ICST 201
Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J
Well-designed and publicly available datasets of bugs are an invaluable asset
to advance research fields such as fault localization and program repair as
they allow directly and fairly comparison between competing techniques and also
the replication of experiments. These datasets need to be deeply understood by
researchers: the answer for questions like "which bugs can my technique
handle?" and "for which bugs is my technique effective?" depends on the
comprehension of properties related to bugs and their patches. However, such
properties are usually not included in the datasets, and there is still no
widely adopted methodology for characterizing bugs and patches. In this work,
we deeply study 395 patches of the Defects4J dataset. Quantitative properties
(patch size and spreading) were automatically extracted, whereas qualitative
ones (repair actions and patterns) were manually extracted using a thematic
analysis-based approach. We found that 1) the median size of Defects4J patches
is four lines, and almost 30% of the patches contain only addition of lines; 2)
92% of the patches change only one file, and 38% has no spreading at all; 3)
the top-3 most applied repair actions are addition of method calls,
conditionals, and assignments, occurring in 77% of the patches; and 4) nine
repair patterns were found for 95% of the patches, where the most prevalent,
appearing in 43% of the patches, is on conditional blocks. These results are
useful for researchers to perform advanced analysis on their techniques'
results based on Defects4J. Moreover, our set of properties can be used to
characterize and compare different bug datasets.Comment: Accepted for SANER'18 (25th edition of IEEE International Conference
on Software Analysis, Evolution and Reengineering), Campobasso, Ital
FixEval: Execution-based Evaluation of Program Fixes for Programming Problems
The increasing complexity of software has led to a drastic rise in time and
costs for identifying and fixing bugs. Various approaches are explored in the
literature to generate fixes for buggy code automatically. However, few tools
and datasets are available to evaluate model-generated fixes effectively due to
the large combinatorial space of possible fixes for a particular bug. In this
work, we introduce FIXEVAL, a benchmark comprising buggy code submissions to
competitive programming problems and their respective fixes. FIXEVAL is
composed of a rich test suite to evaluate and assess the correctness of
model-generated program fixes and further information regarding time and memory
constraints and acceptance based on a verdict. We consider two Transformer
language models pretrained on programming languages as our baselines and
compare them using match-based and execution-based evaluation metrics. Our
experiments show that match-based metrics do not reflect model-generated
program fixes accurately. At the same time, execution-based methods evaluate
programs through all cases and scenarios designed explicitly for that solution.
Therefore, we believe FIXEVAL provides a step towards real-world automatic bug
fixing and model-generated code evaluation. The dataset and models are
open-sourced.\footnote{\url{https://github.com/mahimanzum/FixEval}
Energy Consumption of Automated Program Repair
Automated program repair (APR) aims to automatize the process of repairing
software bugs in order to reduce the cost of maintaining software programs.
Moreover, the success (given by the accuracy metric) of APR approaches has been
increasing in recent years. However, no previous work has considered the energy
impact of repairing bugs automatically using APR. The field of green software
research aims to measure the energy consumption required to develop, maintain
and use software products. This paper combines, for the first time, the APR and
Green software research fields. We have as main goal to define the foundation
for measuring the energy consumption of the APR activity. For that, we present
a set of metrics specially crafted to measure the energy consumption of APR
tools and a generic methodology to calculate them. We instantiate the
methodology in the context of Java program repair. We measure the energy
consumption of 10 program repair tools trying to repair real bugs from
Defects4J, a set of real buggy programs. The initial results from this
experiment show the existing trade-off between energy consumption and the
ability to correctly repair bugs: Some APR tools are capable of achieving
higher accuracy by spending less energy than other tools
- …