34,612 research outputs found

    Automatic Repair of Real Bugs: An Experience Report on the Defects4J Dataset

    Full text link
    Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J is provided with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic repair on Defects4J. The result of our experiment shows that 47 bugs of the Defects4J dataset can be automatically repaired by state-of- the-art repair. This sets a baseline for future research on automatic repair for Java. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly fixed with test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial and incorrect patches still pass the test suite. With respect to practical applicability, it takes in average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All their systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair

    Identifying Patch Correctness in Test-Based Program Repair

    Full text link
    Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGenProg, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches to be generated, without blocking any correct patches.Comment: ICSE 201

    An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems

    Get PDF
    We present the first systematic analysis of the characteristics of patch search spaces for automatic patch generation systems. We analyze the search spaces of two current state-of-the-art systems, SPR and Prophet, with 16 different search space configurations. Our results are derived from an analysis of 1104 different search spaces and 768 patch generation executions. Together these experiments consumed over 9000 hours of CPU time on Amazon EC2. The analysis shows that 1) correct patches are sparse in the search spaces (typically at most one correct patch per search space per defect), 2) incorrect patches that nevertheless pass all of the test cases in the validation test suite are typically orders of magnitude more abundant, and 3) leveraging information other than the test suite is therefore critical for enabling the system to successfully isolate correct patches. We also characterize a key tradeoff in the structure of the search spaces. Larger and richer search spaces that contain correct patches for more defects can actually cause systems to find fewer, not more, correct patches. We identify two reasons for this phenomenon: 1) increased validation times because of the presence of more candidate patches and 2) more incorrect patches that pass the test suite and block the discovery of correct patches. These fundamental properties, which are all characterized for the first time in this paper, help explain why past systems often fail to generate correct patches and help identify challenges, opportunities, and productive future directions for the field
    corecore