8 research outputs found
On the Effectiveness of Information Retrieval Based Bug Localization for C Programs
International audienceLocalizing bugs is important, difficult, and expensive, especially for large software projects. To address this problem, information retrieval (IR) based bug localization has increasingly been used to suggest potential buggy files given a bug report. To date, researchers have proposed a number of IR techniques for bug localization and empirically evaluated them to understand their effectiveness. However, virtually all of the evaluations have been limited to the projects written in object-oriented programming languages, particularly Java. Therefore, the effectiveness of these techniques for other widely-used languages such as C is still unknown. In this paper, we create a benchmark dataset consisting of more than 7,500 bug reports from five popular C projects and rigorously evaluate our recently introduced IR-based bug localization tool using this dataset. Our results indicate that although the IR-relevant properties of C and Java programs are different, IR-based bug localization in C software at the file level is overall as effective as in Java software. However, we also find that the recent advance of using program structure information in performing bug localization gives less of a benefit for C software than for Java software
An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction
Bugs are inescapable during software development due to frequent code
changes, tight deadlines, etc.; therefore, it is important to have tools to
find these errors. One way of performing bug identification is to analyze the
characteristics of buggy source code elements from the past and predict the
present ones based on the same characteristics, using e.g. machine learning
models. To support model building tasks, code elements and their
characteristics are collected in so-called bug datasets which serve as the
input for learning.
We present the \emph{BugHunter Dataset}: a novel kind of automatically
constructed and freely available bug dataset containing code elements (files,
classes, methods) with a wide set of code metrics and bug information. Other
available bug datasets follow the traditional approach of gathering the
characteristics of all source code elements (buggy and non-buggy) at only one
or more pre-selected release versions of the code. Our approach, on the other
hand, captures the buggy and the fixed states of the same source code elements
from the narrowest timeframe we can identify for a bug's presence, regardless
of release versions. To show the usefulness of the new dataset, we built and
evaluated bug prediction models and achieved F-measure values over 0.74
Using bug report similarity to enhance bug localisation
Bug localisation techniques are proposed as a method to reduce the time developers spend on maintenance, allowing them to quickly and source code relevant to a bug. Some techniques are based on information retrieval methods, treating the source code as a corpus and the bug report as a query. While these have shown success, there remain a number of little-exploited additional sources of information which could enhance the techniques, including the textual similarity between bug reports themselves. Based on successful results in detecting duplicate bug reports, this work asks: if duplicate bugs reports, which by denition are xed in the same source location, can be detected through the use of similar language, can bugs which are in the same location but not duplicates be detected in the same way? A technique using this information is implemented and evaluated on 372 bugs across 4 projects, and is found to improve performance on all projects. In particular, the technique increases the number of bugs where the rst relevant method presented to developers is the rst result from 6 to 27, and those in the top-10 from 50 to 57, showing that it can be successfully used to enhance existing bug localisation techniques
Supporting Text Retrieval Query Formulation In Software Engineering
The text found in software artifacts captures important information. Text Retrieval (TR) techniques have been successfully used to leverage this information. Despite their advantages, the success of TR techniques strongly depends on the textual queries given as input. When poorly chosen queries are used, developers can waste time investigating irrelevant results.
The quality of a query indicates the relevance of the results returned by TR in response to the query and can give an indication if the results are worth investigating or a reformulation of the query should be sought instead. Knowing the quality of the query could lead to time saved when irrelevant results are returned. However, the only way to determine if a query led to the wanted artifacts is by manually inspecting the list of results. This dissertation introduces novel approaches to measure and predict the quality of queries automatically in the context of SE tasks, based on a set of statistical properties of the queries. The approaches are evaluated for the task of concept location in source code. The results reveal that the proposed approaches are able to accurately capture and predict the quality of queries for SE tasks supported by TR.
When a query has low quality, the developer can reformulate it and improve it. However, this is just as hard as formulating the query in the first place. This dissertation presents two approaches for partial and complete automation of the query reformulation process. The semi-automatic approach relies on developer feedback about the relevance of TR results and uses this information to automatically reformulate the query. The automatic approach learns and applies the best reformulation approach for a query and relies on a set of training queries and their statistical properties to achieve this. Both approaches are evaluated for concept location and the results show that the techniques are able to improve the results of the original queries in the majority of the cases.
We expect that on the long run the proposed approaches will contribute directly to the reduction of developer effort and implicitly the reduction of software evolution costs
Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks.
Information Retrieval (IR) approaches are used to leverage textual or unstructured data generated during the software development process to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.). Two of the most important steps for applying IR techniques to support SE tasks are preprocessing the corpus and configuring the IR technique, and these steps can significantly influence the outcome and the amount of effort developers have to spend for these maintenance tasks. We present the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support SE tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process without requiring any training. We applied IR-GA on three different SE tasks and the results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach and a combinatorial approach
Bug localisation in ArgoUML, JabRef, JEdit & muCommander
<p>Results from S. Davies, M. Roper and M.Wood, "Using bug report similarity to enhance bug localisation", in Proc. WCRE, 2012 (http://dx.doi.org/10.1109/WCRE.2012.22)</p>
<p> </p>
<p>Can be interpreted with code from https://bitbucket.org/el_loserio/buglanguagesimilarity/src/wcre_2012_published</p