32 research outputs found
Identifying Patch Correctness in Test-Based Program Repair
Test-based automatic program repair has attracted a lot of attention in
recent years. However, the test suites in practice are often too weak to
guarantee correctness and existing approaches often generate a large number of
incorrect patches.
To reduce the number of incorrect patches generated, we propose a novel
approach that heuristically determines the correctness of the generated
patches. The core idea is to exploit the behavior similarity of test case
executions. The passing tests on original and patched programs are likely to
behave similarly while the failing tests on original and patched programs are
likely to behave differently. Also, if two tests exhibit similar runtime
behavior, the two tests are likely to have the same test results. Based on
these observations, we generate new test inputs to enhance the test suites and
use their behavior similarity to determine patch correctness.
Our approach is evaluated on a dataset consisting of 139 patches generated
from existing program repair systems including jGenProg, Nopol, jKali, ACS and
HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches
to be generated, without blocking any correct patches.Comment: ICSE 201
Can LLMs Demystify Bug Reports?
Bugs are notoriously challenging: they slow down software users and result in
time-consuming investigations for developers. These challenges are exacerbated
when bugs must be reported in natural language by users. Indeed, we lack
reliable tools to automatically address reported bugs (i.e., enabling their
analysis, reproduction, and bug fixing). With the recent promises created by
LLMs such as ChatGPT for various tasks, including in software engineering, we
ask ourselves: What if ChatGPT could understand bug reports and reproduce them?
This question will be the main focus of this study. To evaluate whether ChatGPT
is capable of catching the semantics of bug reports, we used the popular
Defects4J benchmark with its bug reports. Our study has shown that ChatGPT was
able to demystify and reproduce 50% of the reported bugs. ChatGPT being able to
automatically address half of the reported bugs shows promising potential in
the direction of applying machine learning to address bugs with only a
human-in-the-loop to report the bug
ObjSim: Lightweight Automatic Patch Prioritization via Object Similarity
In the context of test case based automatic program repair (APR), patches
that pass all the test cases but fail to fix the bug are called overfitted
patches. Currently, patches generated by APR tools get inspected manually by
the users to find and adopt genuine fixes. Being a laborious activity hindering
widespread adoption of APR, automatic identification of overfitted patches has
lately been the topic of active research. This paper presents engineering
details of ObjSim: a fully automatic, lightweight similarity-based patch
prioritization tool for JVM-based languages. The tool works by comparing the
system state at the exit point(s) of patched method before and after patching
and prioritizing patches that result in state that is more similar to that of
original, unpatched version on passing tests while less similar on failing
ones. Our experiments with patches generated by the recent APR tool PraPR for
fixable bugs from Defects4J v1.4.0 show that ObjSim prioritizes 16.67% more
genuine fixes in top-1 place. A demo video of the tool is located at
https://bit.ly/2K8gnYV.Comment: Proceedings of the 29th ACM SIGSOFT International Symposium on
Software Testing and Analysis (ISSTA '20), July 18--22, 2020, Virtual Event,
US
You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems
Properly benchmarking Automated Program Repair (APR) systems should
contribute to the development and adoption of the research outputs by
practitioners. To that end, the research community must ensure that it reaches
significant milestones by reliably comparing state-of-the-art tools for a
better understanding of their strengths and weaknesses. In this work, we
identify and investigate a practical bias caused by the fault localization (FL)
step in a repair pipeline. We propose to highlight the different fault
localization configurations used in the literature, and their impact on APR
systems when applied to the Defects4J benchmark. Then, we explore the
performance variations that can be achieved by `tweaking' the FL step.
Eventually, we expect to create a new momentum for (1) full disclosure of APR
experimental procedures with respect to FL, (2) realistic expectations of
repairing bugs in Defects4J, as well as (3) reliable performance comparison
among the state-of-the-art APR systems, and against the baseline performance
results of our thoroughly assessed kPAR repair tool. Our main findings include:
(a) only a subset of Defects4J bugs can be currently localized by commonly-used
FL techniques; (b) current practice of comparing state-of-the-art APR systems
(i.e., counting the number of fixed bugs) is potentially misleading due to the
bias of FL configurations; and (c) APR authors do not properly qualify their
performance achievement with respect to the different tuning parameters
implemented in APR systems.Comment: Accepted by ICST 201
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models
Software testing is a core discipline in software engineering where a large
array of research results has been produced, notably in the area of automatic
test generation. Because existing approaches produce test cases that either can
be qualified as simple (e.g. unit tests) or that require precise
specifications, most testing procedures still rely on test cases written by
humans to form test suites. Such test suites, however, are incomplete: they
only cover parts of the project or they are produced after the bug is fixed.
Yet, several research challenges, such as automatic program repair, and
practitioner processes, build on the assumption that available test suites are
sufficient. There is thus a need to break existing barriers in automatic test
case generation. While prior work largely focused on random unit testing
inputs, we propose to consider generating test cases that realistically
represent complex user execution scenarios, which reveal buggy behaviour. Such
scenarios are informally described in bug reports, which should therefore be
considered as natural inputs for specifying bug-triggering test cases. In this
work, we investigate the feasibility of performing this generation by
leveraging large language models (LLMs) and using bug reports as inputs. Our
experiments include the use of ChatGPT, as an online service, as well as
CodeGPT, a code-related pre-trained LLM that was fine-tuned for our task.
Overall, we experimentally show that bug reports associated to up to 50% of
Defects4J bugs can prompt ChatGPT to generate an executable test case. We show
that even new bug reports can indeed be used as input for generating executable
test cases. Finally, we report experimental results which confirm that
LLM-generated test cases are immediately useful in software engineering tasks
such as fault localization as well as patch validation in automated program
repair
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions
Large language models (LLMs), such as OpenAI's Codex, have demonstrated their
potential to generate code from natural language descriptions across a wide
range of programming tasks. Several benchmarks have recently emerged to
evaluate the ability of LLMs to generate functionally correct code from natural
language intent with respect to a set of hidden test cases. This has enabled
the research community to identify significant and reproducible advancements in
LLM capabilities. However, there is currently a lack of benchmark datasets for
assessing the ability of LLMs to generate functionally correct code edits based
on natural language descriptions of intended changes. This paper aims to
address this gap by motivating the problem NL2Fix of translating natural
language descriptions of code changes (namely bug fixes described in Issue
reports in repositories) into correct code fixes. To this end, we introduce
Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J
dataset augmented with high-level descriptions of bug fixes, and empirically
evaluate the performance of several state-of-the-art LLMs for the this task.
Results show that these LLMS together are capable of generating plausible fixes
for 64.6% of the bugs, and the best LLM-based technique can achieve up to
21.20% top-1 and 35.68% top-5 accuracy on this benchmark