267 research outputs found
Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair
A large body of the literature of automated program repair develops
approaches where patches are generated to be validated against an oracle (e.g.,
a test suite). Because such an oracle can be imperfect, the generated patches,
although validated by the oracle, may actually be incorrect. While the state of
the art explore research directions that require dynamic information or rely on
manually-crafted heuristics, we study the benefit of learning code
representations to learn deep features that may encode the properties of patch
correctness. Our work mainly investigates different representation learning
approaches for code changes to derive embeddings that are amenable to
similarity computations. We report on findings based on embeddings produced by
pre-trained and re-trained neural networks. Experimental results demonstrate
the potential of embeddings to empower learning algorithms in reasoning about
patch correctness: a machine learning predictor with BERT transformer-based
embeddings associated with logistic regression yielded an AUC value of about
0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled
patches. Our study shows that learned representations can lead to reasonable
performance when comparing against the state-of-the-art, PATCH-SIM, which
relies on dynamic information. These representations may further be
complementary to features that were carefully (manually) engineered in the
literature
Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair
A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings..
Learning Code Change Semantics for Patch Correctness Assessment in Program Repair
State-of-the-art APR techniques currently produce patches that are manually evaluated as overfitting, and these overfitting patches often worsen the original program, leading to negative effects such as introducing security vulnerabilities and removing useful features. This obstructs the development of APR techniques that rely on feedback from correctly generated patches, and the expense of developers’ manual debugging has shifted to evaluating patch correctness. Automated assessment of patch correctness has the potential to reduce patch validation costs and accelerate the identification of practically correct patches, making it easier for developers to adopt APR techniques. While the proposed approaches have been demonstrated to be effective in the literature, several challenges remain unexplored and warrant further investigation.
This thesis begins with an empirical analysis of a prevalent hypothesis concerning patch correctness, leading to the establishment of a patch correctness prediction framework based on representation learning. Second, we propose to validate correct patches by proposing a novel heuristic on the relationship between patches and their associated failing test cases. Lastly, we present a novel perspective to assess patch correctness with natural language processing. Our contributions to the research field through this thesis are as follows: 1) assessing the feasibility of utilizing advancements in deep representation learning to generate patch embeddings suitable for reasoning about correctness. Consequently, we establish Leopard, a supervised learning- based patch correctness prediction framework. 2) comparing code embeddings and engineered features for patch correctness prediction, and investigating their combination in Panther (an upgraded version of Leopard) for more accurate classification. Additionally, we use the SHAP explainability model to reveal the essential aspects of patch correctness by interpreting underlying causes of prediction performance across features and classifiers. 3) presenting and validating a key hypothesis: when different programs fail to pass similar test cases, it is likely that these programs require similar code changes. Based on this heuristic, we propose BATS, an approach predicting patch correctness by statically comparing generated patches against previous correct patches failing on similar tests. 4) proposing a novel perspective to patch correctness assessment: a correct patch implements changes that answer to the issue caused by the buggy behavior. By leveraging bug reports to offer an explicit description of the bug, we build Quatrain, a supervised learning approach that utilizes a deep NLP model to predict the relevance between a bug report and a patch description
ObjSim: Lightweight Automatic Patch Prioritization via Object Similarity
In the context of test case based automatic program repair (APR), patches
that pass all the test cases but fail to fix the bug are called overfitted
patches. Currently, patches generated by APR tools get inspected manually by
the users to find and adopt genuine fixes. Being a laborious activity hindering
widespread adoption of APR, automatic identification of overfitted patches has
lately been the topic of active research. This paper presents engineering
details of ObjSim: a fully automatic, lightweight similarity-based patch
prioritization tool for JVM-based languages. The tool works by comparing the
system state at the exit point(s) of patched method before and after patching
and prioritizing patches that result in state that is more similar to that of
original, unpatched version on passing tests while less similar on failing
ones. Our experiments with patches generated by the recent APR tool PraPR for
fixable bugs from Defects4J v1.4.0 show that ObjSim prioritizes 16.67% more
genuine fixes in top-1 place. A demo video of the tool is located at
https://bit.ly/2K8gnYV.Comment: Proceedings of the 29th ACM SIGSOFT International Symposium on
Software Testing and Analysis (ISSTA '20), July 18--22, 2020, Virtual Event,
US
The Best of Both Worlds: Combining Learned Embeddings with Engineered Features for Accurate Prediction of Correct Patches
A large body of the literature on automated program repair develops
approaches where patches are automatically generated to be validated against an
oracle (e.g., a test suite). Because such an oracle can be imperfect, the
generated patches, although validated by the oracle, may actually be incorrect.
Our empirical work investigates different representation learning approaches
for code changes to derive embeddings that are amenable to similarity
computations of patch correctness identification, and assess the possibility of
accurate classification of correct patch by combining learned embeddings with
engineered features. Experimental results demonstrate the potential of learned
embeddings to empower Leopard (a patch correctness predicting framework
implemented in this work) with learning algorithms in reasoning about patch
correctness: a machine learning predictor with BERT transformer-based learned
embeddings associated with XGBoost achieves an AUC value of about 0.895 in the
prediction of patch correctness on a new dataset of 2,147 labeled patches that
we collected for the experiments. Our investigations show that deep learned
embeddings can lead to complementary/better performance when comparing against
the state-of-the-art, PATCH-SIM, which relies on dynamic information. By
combining deep learned embeddings and engineered features, Panther (the
upgraded version of Leopard implemented in this work) outperforms Leopard with
higher scores in terms of AUC, +Recall and -Recall, and can accurately identify
more (in)correct patches that cannot be predicted by the classifiers only with
learned embeddings or engineered features. Finally, we use an explainable ML
technique, SHAP, to empirically interpret how the learned embeddings and
engineered features are contributed to the patch correctness prediction.Comment: arXiv admin note: substantial text overlap with arXiv:2008.0294
Test-based Patch Clustering for Automatically-Generated Patches Assessment
Previous studies have shown that Automated Program Repair (APR) techniques
suffer from the overfitting problem. Overfitting happens when a patch is run
and the test suite does not reveal any error, but the patch actually does not
fix the underlying bug or it introduces a new defect that is not covered by the
test suite. Therefore, the patches generated by APR tools need to be validated
by human programmers, which can be very costly, and prevents APR tools adoption
in practice.Our work aims at increasing developer trust in automated patch
generation by minimizing the number of plausible patches that they have to
review, thereby reducing the time required to find a correct patch. We
introduce a novel light-weight test-based patch clustering approach called
xTestCluster, which clusters patches based on their dynamic behavior.
xTestCluster is applied after the patch generation phase in order to analyze
the generated patches from one or more repair tools. The novelty of
xTestCluster lies in using information from execution of newly generated test
cases to cluster patches generated by multiple APR approaches. A cluster is
formed with patches that fail on the same generated test cases. The output from
xTestCluster gives developers a) a way of reducing the number of patches to
analyze, as they can focus on analyzing a sample of patches from each cluster,
b) additional information attached to each patch. After analyzing 1910
plausible patches from 25 Java APR tools, our results show that xTestCluster is
able to reduce the number of patches to review and analyze with a median of
50%. xTestCluster can save a significant amount of time for developers that
have to review the multitude of patches generated by APR tools, and provides
them with new test cases that show the differences in behavior between
generated patches
Predicting Patch Correctness Based on the Similarity of Failing Test Cases
Towards predicting patch correctness in APR, we propose a simple, but novel
hypothesis on how the link between the patch behaviour and failing test
specifications can be drawn: similar failing test cases should require similar
patches. We then propose BATS, an unsupervised learning-based system to predict
patch correctness by checking patch Behaviour Against failing Test
Specification. BATS exploits deep representation learning models for code and
patches: for a given failing test case, the yielded embedding is used to
compute similarity metrics in the search for historical similar test cases in
order to identify the associated applied patches, which are then used as a
proxy for assessing generated patch correctness. Experimentally, we first
validate our hypothesis by assessing whether ground-truth developer patches
cluster together in the same way that their associated failing test cases are
clustered. Then, after collecting a large dataset of 1278 plausible patches
(written by developers or generated by some 32 APR tools), we use BATS to
predict correctness: BATS achieves an AUC between 0.557 to 0.718 and a recall
between 0.562 and 0.854 in identifying correct patches. Compared against
previous work, we demonstrate that our approach outperforms state-of-the-art
performance in patch correctness prediction, without the need for large labeled
patch datasets in contrast with prior machine learning-based approaches. While
BATS is constrained by the availability of similar test cases, we show that it
can still be complementary to existing approaches: used in conjunction with a
recent approach implementing supervised learning, BATS improves the overall
recall in detecting correct patches. We finally show that BATS can be
complementary to the state-of-the-art PATCH-SIM dynamic approach of identifying
the correct patches for APR tools
- …