267 research outputs found

    Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair

    Get PDF
    A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature

    Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair

    Get PDF
    A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings..

    Learning Code Change Semantics for Patch Correctness Assessment in Program Repair

    Get PDF
    State-of-the-art APR techniques currently produce patches that are manually evaluated as overfitting, and these overfitting patches often worsen the original program, leading to negative effects such as introducing security vulnerabilities and removing useful features. This obstructs the development of APR techniques that rely on feedback from correctly generated patches, and the expense of developers’ manual debugging has shifted to evaluating patch correctness. Automated assessment of patch correctness has the potential to reduce patch validation costs and accelerate the identification of practically correct patches, making it easier for developers to adopt APR techniques. While the proposed approaches have been demonstrated to be effective in the literature, several challenges remain unexplored and warrant further investigation. This thesis begins with an empirical analysis of a prevalent hypothesis concerning patch correctness, leading to the establishment of a patch correctness prediction framework based on representation learning. Second, we propose to validate correct patches by proposing a novel heuristic on the relationship between patches and their associated failing test cases. Lastly, we present a novel perspective to assess patch correctness with natural language processing. Our contributions to the research field through this thesis are as follows: 1) assessing the feasibility of utilizing advancements in deep representation learning to generate patch embeddings suitable for reasoning about correctness. Consequently, we establish Leopard, a supervised learning- based patch correctness prediction framework. 2) comparing code embeddings and engineered features for patch correctness prediction, and investigating their combination in Panther (an upgraded version of Leopard) for more accurate classification. Additionally, we use the SHAP explainability model to reveal the essential aspects of patch correctness by interpreting underlying causes of prediction performance across features and classifiers. 3) presenting and validating a key hypothesis: when different programs fail to pass similar test cases, it is likely that these programs require similar code changes. Based on this heuristic, we propose BATS, an approach predicting patch correctness by statically comparing generated patches against previous correct patches failing on similar tests. 4) proposing a novel perspective to patch correctness assessment: a correct patch implements changes that answer to the issue caused by the buggy behavior. By leveraging bug reports to offer an explicit description of the bug, we build Quatrain, a supervised learning approach that utilizes a deep NLP model to predict the relevance between a bug report and a patch description

    ObjSim: Lightweight Automatic Patch Prioritization via Object Similarity

    Full text link
    In the context of test case based automatic program repair (APR), patches that pass all the test cases but fail to fix the bug are called overfitted patches. Currently, patches generated by APR tools get inspected manually by the users to find and adopt genuine fixes. Being a laborious activity hindering widespread adoption of APR, automatic identification of overfitted patches has lately been the topic of active research. This paper presents engineering details of ObjSim: a fully automatic, lightweight similarity-based patch prioritization tool for JVM-based languages. The tool works by comparing the system state at the exit point(s) of patched method before and after patching and prioritizing patches that result in state that is more similar to that of original, unpatched version on passing tests while less similar on failing ones. Our experiments with patches generated by the recent APR tool PraPR for fixable bugs from Defects4J v1.4.0 show that ObjSim prioritizes 16.67% more genuine fixes in top-1 place. A demo video of the tool is located at https://bit.ly/2K8gnYV.Comment: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '20), July 18--22, 2020, Virtual Event, US

    The Best of Both Worlds: Combining Learned Embeddings with Engineered Features for Accurate Prediction of Correct Patches

    Get PDF
    A large body of the literature on automated program repair develops approaches where patches are automatically generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. Our empirical work investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations of patch correctness identification, and assess the possibility of accurate classification of correct patch by combining learned embeddings with engineered features. Experimental results demonstrate the potential of learned embeddings to empower Leopard (a patch correctness predicting framework implemented in this work) with learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based learned embeddings associated with XGBoost achieves an AUC value of about 0.895 in the prediction of patch correctness on a new dataset of 2,147 labeled patches that we collected for the experiments. Our investigations show that deep learned embeddings can lead to complementary/better performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. By combining deep learned embeddings and engineered features, Panther (the upgraded version of Leopard implemented in this work) outperforms Leopard with higher scores in terms of AUC, +Recall and -Recall, and can accurately identify more (in)correct patches that cannot be predicted by the classifiers only with learned embeddings or engineered features. Finally, we use an explainable ML technique, SHAP, to empirically interpret how the learned embeddings and engineered features are contributed to the patch correctness prediction.Comment: arXiv admin note: substantial text overlap with arXiv:2008.0294

    Test-based Patch Clustering for Automatically-Generated Patches Assessment

    Full text link
    Previous studies have shown that Automated Program Repair (APR) techniques suffer from the overfitting problem. Overfitting happens when a patch is run and the test suite does not reveal any error, but the patch actually does not fix the underlying bug or it introduces a new defect that is not covered by the test suite. Therefore, the patches generated by APR tools need to be validated by human programmers, which can be very costly, and prevents APR tools adoption in practice.Our work aims at increasing developer trust in automated patch generation by minimizing the number of plausible patches that they have to review, thereby reducing the time required to find a correct patch. We introduce a novel light-weight test-based patch clustering approach called xTestCluster, which clusters patches based on their dynamic behavior. xTestCluster is applied after the patch generation phase in order to analyze the generated patches from one or more repair tools. The novelty of xTestCluster lies in using information from execution of newly generated test cases to cluster patches generated by multiple APR approaches. A cluster is formed with patches that fail on the same generated test cases. The output from xTestCluster gives developers a) a way of reducing the number of patches to analyze, as they can focus on analyzing a sample of patches from each cluster, b) additional information attached to each patch. After analyzing 1910 plausible patches from 25 Java APR tools, our results show that xTestCluster is able to reduce the number of patches to review and analyze with a median of 50%. xTestCluster can save a significant amount of time for developers that have to review the multitude of patches generated by APR tools, and provides them with new test cases that show the differences in behavior between generated patches

    Predicting Patch Correctness Based on the Similarity of Failing Test Cases

    Get PDF
    Towards predicting patch correctness in APR, we propose a simple, but novel hypothesis on how the link between the patch behaviour and failing test specifications can be drawn: similar failing test cases should require similar patches. We then propose BATS, an unsupervised learning-based system to predict patch correctness by checking patch Behaviour Against failing Test Specification. BATS exploits deep representation learning models for code and patches: for a given failing test case, the yielded embedding is used to compute similarity metrics in the search for historical similar test cases in order to identify the associated applied patches, which are then used as a proxy for assessing generated patch correctness. Experimentally, we first validate our hypothesis by assessing whether ground-truth developer patches cluster together in the same way that their associated failing test cases are clustered. Then, after collecting a large dataset of 1278 plausible patches (written by developers or generated by some 32 APR tools), we use BATS to predict correctness: BATS achieves an AUC between 0.557 to 0.718 and a recall between 0.562 and 0.854 in identifying correct patches. Compared against previous work, we demonstrate that our approach outperforms state-of-the-art performance in patch correctness prediction, without the need for large labeled patch datasets in contrast with prior machine learning-based approaches. While BATS is constrained by the availability of similar test cases, we show that it can still be complementary to existing approaches: used in conjunction with a recent approach implementing supervised learning, BATS improves the overall recall in detecting correct patches. We finally show that BATS can be complementary to the state-of-the-art PATCH-SIM dynamic approach of identifying the correct patches for APR tools
    • …
    corecore