6 research outputs found
Automatically Repairing Programs Using Both Tests and Bug Reports
The success of automated program repair (APR) depends significantly on its
ability to localize the defects it is repairing. For fault localization (FL),
APR tools typically use either spectrum-based (SBFL) techniques that use test
executions or information-retrieval-based (IRFL) techniques that use bug
reports. These two approaches often complement each other, patching different
defects. No existing repair tool uses both SBFL and IRFL. We develop RAFL
(Rank-Aggregation-Based Fault Localization), a novel FL approach that combines
multiple FL techniques. We also develop Blues, a new IRFL technique that uses
bug reports, and an unsupervised approach to localize defects. On a dataset of
818 real-world defects, SBIR (combined SBFL and Blues) consistently localizes
more bugs and ranks buggy statements higher than the two underlying techniques.
For example, SBIR correctly identifies a buggy statement as the most suspicious
for 18.1% of the defects, while SBFL does so for 10.9% and Blues for 3.1%. We
extend SimFix, a state-of-the-art APR tool, to use SBIR, SBFL, and Blues.
SimFix using SBIR patches 112 out of the 818 defects; 110 when using SBFL, and
55 when using Blues. The 112 patched defects include 55 defects patched
exclusively using SBFL, 7 patched exclusively using IRFL, 47 patched using both
SBFL and IRFL and 3 new defects. SimFix using Blues significantly outperforms
iFixR, the state-of-the-art IRFL-based APR tool. Overall, SimFix using our FL
techniques patches ten defects no prior tools could patch. By evaluating on a
benchmark of 818 defects, 442 previously unused in APR evaluations, we find
that prior evaluations on the overused Defects4J benchmark have led to overly
generous findings. Our paper is the first to (1) use combined FL for APR, (2)
apply a more rigorous methodology for measuring patch correctness, and (3)
evaluate on the new, substantially larger version of Defects4J.Comment: working pape
Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System
International audienceAmong the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing
Improving the Correctness of Automated Program Repair
Developers spend much of their time fixing bugs in software programs. Automated program repair (APR) techniques aim to alleviate the burden of bug fixing from developers by generating patches at the source-code level. Recently, Generate-and-Validate (G&V) APR techniques show great potential to repair general bugs in real-world applications. Recent evaluations show that G&V techniques repair 8–17.7% of the collected bugs from mature Java or C open-source projects. Despite the promising results, G&V techniques may generate many incorrect patches and are not able to repair every single bug.
This thesis makes contributions to improve the correctness of APR by improving the quality assurance of the automatically-generated patches and generating more correct patches by leveraging human knowledge. First, this thesis investigates whether improving the test-suite-based validation can precisely identify incorrect patches that are generated by G&V, and whether it can help G&V generate more correct patches. The result of this investigation, Opad, which combines new fuzz-generated test cases and additional oracles (i.e., memory oracles), is proposed to identify incorrect patches and help G&V repair more bugs correctly. The evaluation of Opad shows that the improved test-suite-based validation identifies 75.2% incorrect patches from G&V techniques. With the integration of Opad, SPR, one of the most promising G&V techniques, repairs one additional bug.
Second, this thesis proposes novel APR techniques to repair more bugs correctly, by leveraging human knowledge. Thus, APR techniques can repair new types of bugs that are not currently targeted by G&V APR techniques. Human knowledge in bug-fixing activities is noted in the forms such as commits of bug fixes, developers’ expertise, and documentation pages. Two techniques (APARE and Priv) are proposed to target two types of defects respectively: project-specific recurring bugs and vulnerability warnings by static analysis.
APARE automatically learns fix patterns from historical bug fixes (i.e., originally crafted by developers), utilizes spectrum-based fault-localization technique to identify highly-likely faulty methods, and applies the learned fix patterns to generate patches for developers to review. The key innovation of APARE is to utilize a percentage semantic-aware matching algorithm between fix patterns and faulty locations. For the 20 recurring bugs, APARE generates 34 method fixes, 24 of which (70.6%) are correct; 83.3% (20 out of 24) are identical to the fixes generated by developers. In addition, APARE complements current repair systems by generating 20 high-quality method fixes that RSRepair and PAR cannot generate.
Priv is a multi-stage remediation system specifically designed for static-analysis security-testing (SAST) techniques. The prototype is built and evaluated on a commercial SAST product. The first stage of Priv is to prioritize workloads of fixing vulnerability warnings based on shared fix locations. The likely fix locations are suggested based on a set of rules. The rules are concluded and developed through the collaboration with two security experts. The second stage of Priv provides additional essential information for improving the efficiency of diagnosis and fixing. Priv offers two types of additional information: identifying true database/attribute-related warnings, and providing customized fix suggestions per warning. The evaluation shows that Priv suggests identical fix locations to the ones suggested by developers for 50–100% of the evaluated vulnerability findings. Priv identifies up to 2170 actionable vulnerability findings for the evaluated six projects. The manual examination confirms that Priv can generate patches of high-quality for many of the evaluated vulnerability warnings
Recommended from our members
Detect and Repair Errors for DNN-based Software
Nowadays, deep neural networks based software have been widely applied in many areas including safety-critical areas such as traffic control, medical diagnosis and malware detection, etc. However, the software engineering techniques, which are supposed to guarantee the functionality, safety as well as fairness, are not well studied. For example, some serious crashes of DNN based autonomous cars have been reported. These crashes could have been avoided if these DNN based software were well tested. Traditional software testing, debugging or repairing techniques do not work well on DNN based software because there is no control flow, data flow or AST(Abstract Syntax Tree) in deep neural networks. Proposing software engineering techniques targeted on DNN based software are imperative. In this thesis, we first introduced the development of SE(Software Engineering) for AI(Artificial Intelligence) area and how our works have influenced the advancement of this new area. Then we summarized related works and some important concepts in SE for AI area. Finally, we discussed four important works of ours.
Our first project DeepTest is one of the first few papers proposing systematic software testing techniques for DNN based software. We proposed neuron coverage guided image synthesis techniques for DNN based autonomous cars and leveraged domain specific metamorphic relation to generate oracle for new generated test cases to automatically test DNN based software. We applied DeepTest to testing three top performing self-driving car models in Udacity self-driving car challenge and our tool has identified thousands of erroneous behaviors that may lead to potential fatal crash.
In DeepTest project, we found that the natural variation such as spatial transformations or rain/fog effects have led to problematic corner cases for DNN based self-driving cars. In the follow-up project DeepRobust, we studied per-point robustness of deep neural network under natural variation. We found that for a DNN model, some specific weak points are more likely to cause erroneous outputs than others under natural variation. We proposed a white-box approach and a black-box approach to identify these weak data points. We implemented and evaluated our approaches on 9 DNN based image classifiers and 3 DNN based self-driving car models. Our approaches can successfully detect weak points with good precision and recall for both DNN based image classifiers and self-driving cars.
Most of existing works in SE for AI area including our DeepTest and DeepRobust focus on instance-wise errors, which are single inputs that result in a DNN model's erroneous outputs. Different from instance-wise errors, group-level errors reflect a DNN model's weak performance on differentiating among certain classes or inconsistent performance across classes. This type of errors is very concerning since it has been found to be related to many real-world notorious errors without malicious attackers. In our third project DeepInspect, we first introduced the group-level errors for DNN based software and categorized them into confusion errors and bias errors based on real-world reports. Then we proposed neuron coverage based distance metric to detect group-level errors for DNN based software without requiring labels. We applied DeepInspect to testing 8 pretrained DNN models trained in 6 popular image classification datasets, including three adversarial trained models. We showed that DeepInspect can successfully detect group-level violations for both single-label and multi-label classification models with high precision.
As a follow-up and more challenging research project, we proposed five WR(weighted regularization) techniques to repair group-level errors for DNN based software. These five different weighted regularization techniques function at different stages of retraining or inference of DNNs including input phase, layer phase, loss phase and output phase. We compared and evaluated these five different WR techniques in both single-label and multi-label classifications including five combinations of four DNN architectures on four datasets. We showed that WR can effectively fix confusion and bias errors and these methods all have their pros, cons and applicable scenario.
All our four projects discussed in this thesis have solved important problems in ensuring the functionality, safety as well as fairness for DNN based software and had significant influence in the advancement of SE for AI area