68 research outputs found
Can LLMs Demystify Bug Reports?
Bugs are notoriously challenging: they slow down software users and result in
time-consuming investigations for developers. These challenges are exacerbated
when bugs must be reported in natural language by users. Indeed, we lack
reliable tools to automatically address reported bugs (i.e., enabling their
analysis, reproduction, and bug fixing). With the recent promises created by
LLMs such as ChatGPT for various tasks, including in software engineering, we
ask ourselves: What if ChatGPT could understand bug reports and reproduce them?
This question will be the main focus of this study. To evaluate whether ChatGPT
is capable of catching the semantics of bug reports, we used the popular
Defects4J benchmark with its bug reports. Our study has shown that ChatGPT was
able to demystify and reproduce 50% of the reported bugs. ChatGPT being able to
automatically address half of the reported bugs shows promising potential in
the direction of applying machine learning to address bugs with only a
human-in-the-loop to report the bug
Mining Fix Patterns for FindBugs Violations
In this paper, we first collect and track a large number of fixed and unfixed
violations across revisions of software.
The empirical analyses reveal that there are discrepancies in the
distributions of violations that are detected and those that are fixed, in
terms of occurrences, spread and categories, which can provide insights into
prioritizing violations.
To automatically identify patterns in violations and their fixes, we propose
an approach that utilizes convolutional neural networks to learn features and
clustering to regroup similar instances. We then evaluate the usefulness of the
identified fix patterns by applying them to unfixed violations.
The results show that developers will accept and merge a majority (69/116) of
fixes generated from the inferred fix patterns. It is also noteworthy that the
yielded patterns are applicable to four real bugs in the Defects4J major
benchmark for software testing and automated repair.Comment: Accepted for IEEE Transactions on Software Engineerin
A Dataset of Android Libraries
Android app developers extensively employ code reuse, integrating many
third-party libraries into their apps. While such integration is practical for
developers, it can be challenging for static analyzers to achieve scalability
and precision when such libraries can account for a large part of the app code.
As a direct consequence, when a static analysis is performed, it is common
practice in the literature to only consider developer code --with the
assumption that the sought issues are in developer code rather than in the
libraries. However, analysts need to precisely distinguish between library code
and developer code in Android apps to ensure the effectiveness of static
analysis. Currently, many static analysis approaches rely on white lists of
libraries. However, these white lists are unreliable, as they are inaccurate
and largely non-comprehensive.
In this paper, we propose a new approach to address the lack of comprehensive
and automated solutions for the production of accurate and "always up to date"
sets of third-party libraries. First, we demonstrate the continued need for a
white list of third-party libraries. Second, we propose an automated approach
to produce an accurate and up-to-date set of third-party libraries in the form
of a dataset called AndroLibZoo. Our dataset, which we make available to the
research community, contains to date 20 162 libraries and is meant to evolve.
Third, we illustrate the significance of using AndroLibZoo to filter libraries
in recent apps. Fourth, we demonstrate that AndroLibZoo is more suitable than
the current state-of-the-art list for improved static analysis. Finally, we
show how the use of AndroLibZoo can enhance the performance of existing Android
app static analyzers
TBar: Revisiting Template-based Automated Program Repair
We revisit the performance of template-based APR to build comprehensive
knowledge about the effectiveness of fix patterns, and to highlight the
importance of complementary steps such as fault localization or donor code
retrieval. To that end, we first investigate the literature to collect,
summarize and label recurrently-used fix patterns. Based on the investigation,
we build TBar, a straightforward APR tool that systematically attempts to apply
these fix patterns to program bugs. We thoroughly evaluate TBar on the
Defects4J benchmark. In particular, we assess the actual qualitative and
quantitative diversity of fix patterns, as well as their effectiveness in
yielding plausible or correct patches. Eventually, we find that, assuming a
perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs.
Replicating a standard and practical pipeline of APR assessment, we demonstrate
that TBar correctly fixes 43 bugs from Defects4J, an unprecedented performance
in the literature (including all approaches, i.e., template-based, stochastic
mutation-based or synthesis-based APR).Comment: Accepted by ISSTA 201
You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems
Properly benchmarking Automated Program Repair (APR) systems should
contribute to the development and adoption of the research outputs by
practitioners. To that end, the research community must ensure that it reaches
significant milestones by reliably comparing state-of-the-art tools for a
better understanding of their strengths and weaknesses. In this work, we
identify and investigate a practical bias caused by the fault localization (FL)
step in a repair pipeline. We propose to highlight the different fault
localization configurations used in the literature, and their impact on APR
systems when applied to the Defects4J benchmark. Then, we explore the
performance variations that can be achieved by `tweaking' the FL step.
Eventually, we expect to create a new momentum for (1) full disclosure of APR
experimental procedures with respect to FL, (2) realistic expectations of
repairing bugs in Defects4J, as well as (3) reliable performance comparison
among the state-of-the-art APR systems, and against the baseline performance
results of our thoroughly assessed kPAR repair tool. Our main findings include:
(a) only a subset of Defects4J bugs can be currently localized by commonly-used
FL techniques; (b) current practice of comparing state-of-the-art APR systems
(i.e., counting the number of fixed bugs) is potentially misleading due to the
bias of FL configurations; and (c) APR authors do not properly qualify their
performance achievement with respect to the different tuning parameters
implemented in APR systems.Comment: Accepted by ICST 201
- …