1,302 research outputs found
Variable-Based Fault Localization via Enhanced Decision Tree
Fault localization, aiming at localizing the root cause of the bug under
repair, has been a longstanding research topic. Although many approaches have
been proposed in the last decades, most of the existing studies work at
coarse-grained statement or method levels with very limited insights about how
to repair the bug (granularity problem), but few studies target the
finer-grained fault localization. In this paper, we target the granularity
problem and propose a novel finer-grained variable-level fault localization
technique. Specifically, we design a program-dependency-enhanced decision tree
model to boost the identification of fault-relevant variables via
discriminating failed and passed test cases based on the variable values. To
evaluate the effectiveness of our approach, we have implemented it in a tool
called VARDT and conducted an extensive study over the Defects4J benchmark. The
results show that VARDT outperforms the state-of-the-art fault localization
approaches with at least 247.8% improvements in terms of bugs located at Top-1,
and the average improvements are 330.5%.
Besides, to investigate whether our finer-grained fault localization result
can further improve the effectiveness of downstream APR techniques, we have
adapted VARDT to the application of patch filtering, where VARDT outperforms
the state-of-the-art PATCH-SIM by filtering 26.0% more incorrect patches. The
results demonstrate the effectiveness of our approach and it also provides a
new way of thinking for improving automatic program repair techniques
Mutation-based Fault Localization of Deep Neural Networks
Deep neural networks (DNNs) are susceptible to bugs, just like other types of
software systems. A significant uptick in using DNN, and its applications in
wide-ranging areas, including safety-critical systems, warrant extensive
research on software engineering tools for improving the reliability of
DNN-based systems. One such tool that has gained significant attention in the
recent years is DNN fault localization. This paper revisits mutation-based
fault localization in the context of DNN models and proposes a novel technique,
named deepmufl, applicable to a wide range of DNN models. We have implemented
deepmufl and have evaluated its effectiveness using 109 bugs obtained from
StackOverflow. Our results show that deepmufl detects 53/109 of the bugs by
ranking the buggy layer in top-1 position, outperforming state-of-the-art
static and dynamic DNN fault localization systems that are also designed to
target the class of bugs supported by deepmufl. Moreover, we observed that we
can halve the fault localization time for a pre-trained model using mutation
selection, yet losing only 7.55% of the bugs localized in top-1 position.Comment: 38th IEEE/ACM International Conference on Automated Software
Engineering (ASE 2023
Automatically Repairing Programs Using Both Tests and Bug Reports
The success of automated program repair (APR) depends significantly on its
ability to localize the defects it is repairing. For fault localization (FL),
APR tools typically use either spectrum-based (SBFL) techniques that use test
executions or information-retrieval-based (IRFL) techniques that use bug
reports. These two approaches often complement each other, patching different
defects. No existing repair tool uses both SBFL and IRFL. We develop RAFL
(Rank-Aggregation-Based Fault Localization), a novel FL approach that combines
multiple FL techniques. We also develop Blues, a new IRFL technique that uses
bug reports, and an unsupervised approach to localize defects. On a dataset of
818 real-world defects, SBIR (combined SBFL and Blues) consistently localizes
more bugs and ranks buggy statements higher than the two underlying techniques.
For example, SBIR correctly identifies a buggy statement as the most suspicious
for 18.1% of the defects, while SBFL does so for 10.9% and Blues for 3.1%. We
extend SimFix, a state-of-the-art APR tool, to use SBIR, SBFL, and Blues.
SimFix using SBIR patches 112 out of the 818 defects; 110 when using SBFL, and
55 when using Blues. The 112 patched defects include 55 defects patched
exclusively using SBFL, 7 patched exclusively using IRFL, 47 patched using both
SBFL and IRFL and 3 new defects. SimFix using Blues significantly outperforms
iFixR, the state-of-the-art IRFL-based APR tool. Overall, SimFix using our FL
techniques patches ten defects no prior tools could patch. By evaluating on a
benchmark of 818 defects, 442 previously unused in APR evaluations, we find
that prior evaluations on the overused Defects4J benchmark have led to overly
generous findings. Our paper is the first to (1) use combined FL for APR, (2)
apply a more rigorous methodology for measuring patch correctness, and (3)
evaluate on the new, substantially larger version of Defects4J.Comment: working pape
Back to the Future! Studying Data Cleanness in Defects4J and its Impact on Fault Localization
For software testing research, Defects4J stands out as the primary benchmark
dataset, offering a controlled environment to study real bugs from prominent
open-source systems. However, prior research indicates that Defects4J might
include tests added post-bug report, embedding developer knowledge and
affecting fault localization efficacy. In this paper, we examine Defects4J's
fault-triggering tests, emphasizing the implications of developer knowledge of
SBFL techniques. We study the timelines of changes made to these tests
concerning bug report creation. Then, we study the effectiveness of SBFL
techniques without developer knowledge in the tests. We found that 1) 55% of
the fault-triggering tests were newly added to replicate the bug or to test for
regression; 2) 22% of the fault-triggering tests were modified after the bug
reports were created, containing developer knowledge of the bug; 3) developers
often modify the tests to include new assertions or change the test code to
reflect the changes in the source code; and 4) the performance of SBFL
techniques degrades significantly (up to --415% for Mean First Rank) when
evaluated on the bugs without developer knowledge. We provide a dataset of bugs
without developer insights, aiding future SBFL evaluations in Defects4J and
informing considerations for future bug benchmarks
Directed Test Program Generation for JIT Compiler Bug Localization
Bug localization techniques for Just-in-Time (JIT) compilers are based on
analyzing the execution behaviors of the target JIT compiler on a set of test
programs generated for this purpose; characteristics of these test inputs can
significantly impact the accuracy of bug localization. However, current
approaches for automatic test program generation do not work well for bug
localization in JIT compilers. This paper proposes a novel technique for
automatic test program generation for JIT compiler bug localization that is
based on two key insights: (1) the generated test programs should contain both
passing inputs (which do not trigger the bug) and failing inputs (which trigger
the bug); and (2) the passing inputs should be as similar as possible to the
initial seed input, while the failing programs should be as different as
possible from it. We use a structural analysis of the seed program to determine
which parts of the code should be mutated for each of the passing and failing
cases. Experiments using a prototype implementation indicate that test inputs
generated using our approach result in significantly improved bug localization
results than existing approaches
Large Language Models in Fault Localisation
Large Language Models (LLMs) have shown promise in multiple software
engineering tasks including code generation, code summarisation, test
generation and code repair. Fault localisation is essential for facilitating
automatic program debugging and repair, and is demonstrated as a highlight at
ChatGPT-4's launch event. Nevertheless, there has been little work
understanding LLMs' capabilities for fault localisation in large-scale
open-source programs. To fill this gap, this paper presents an in-depth
investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two
state-of-the-art LLMs, on fault localisation. Using the widely-adopted
Defects4J dataset, we compare the two LLMs with the existing fault localisation
techniques. We also investigate the stability and explanation of LLMs in fault
localisation, as well as how prompt engineering and the length of code context
affect the fault localisation effectiveness. Our findings demonstrate that
within a limited code context, ChatGPT-4 outperforms all the existing fault
localisation methods. Additional error logs can further improve ChatGPT models'
localisation accuracy and stability, with an average 46.9% higher accuracy over
the state-of-the-art baseline SmartFL in terms of TOP-1 metric. However,
performance declines dramatically when the code context expands to the
class-level, with ChatGPT models' effectiveness becoming inferior to the
existing methods overall. Additionally, we observe that ChatGPT's
explainability is unsatisfactory, with an accuracy rate of only approximately
30%. These observations demonstrate that while ChatGPT can achieve effective
fault localisation performance under certain conditions, evident limitations
exist. Further research is imperative to fully harness the potential of LLMs
like ChatGPT for practical fault localisation applications
- …