8 research outputs found

    On the Effectiveness of Information Retrieval Based Bug Localization for C Programs

    Get PDF
    International audienceLocalizing bugs is important, difficult, and expensive, especially for large software projects. To address this problem, information retrieval (IR) based bug localization has increasingly been used to suggest potential buggy files given a bug report. To date, researchers have proposed a number of IR techniques for bug localization and empirically evaluated them to understand their effectiveness. However, virtually all of the evaluations have been limited to the projects written in object-oriented programming languages, particularly Java. Therefore, the effectiveness of these techniques for other widely-used languages such as C is still unknown. In this paper, we create a benchmark dataset consisting of more than 7,500 bug reports from five popular C projects and rigorously evaluate our recently introduced IR-based bug localization tool using this dataset. Our results indicate that although the IR-relevant properties of C and Java programs are different, IR-based bug localization in C software at the file level is overall as effective as in Java software. However, we also find that the recent advance of using program structure information in performing bug localization gives less of a benefit for C software than for Java software

    An Automatically Created Novel Bug Dataset and its Validation in Bug Prediction

    Get PDF
    Bugs are inescapable during software development due to frequent code changes, tight deadlines, etc.; therefore, it is important to have tools to find these errors. One way of performing bug identification is to analyze the characteristics of buggy source code elements from the past and predict the present ones based on the same characteristics, using e.g. machine learning models. To support model building tasks, code elements and their characteristics are collected in so-called bug datasets which serve as the input for learning. We present the \emph{BugHunter Dataset}: a novel kind of automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information. Other available bug datasets follow the traditional approach of gathering the characteristics of all source code elements (buggy and non-buggy) at only one or more pre-selected release versions of the code. Our approach, on the other hand, captures the buggy and the fixed states of the same source code elements from the narrowest timeframe we can identify for a bug's presence, regardless of release versions. To show the usefulness of the new dataset, we built and evaluated bug prediction models and achieved F-measure values over 0.74

    Using bug report similarity to enhance bug localisation

    No full text
    Bug localisation techniques are proposed as a method to reduce the time developers spend on maintenance, allowing them to quickly and source code relevant to a bug. Some techniques are based on information retrieval methods, treating the source code as a corpus and the bug report as a query. While these have shown success, there remain a number of little-exploited additional sources of information which could enhance the techniques, including the textual similarity between bug reports themselves. Based on successful results in detecting duplicate bug reports, this work asks: if duplicate bugs reports, which by denition are xed in the same source location, can be detected through the use of similar language, can bugs which are in the same location but not duplicates be detected in the same way? A technique using this information is implemented and evaluated on 372 bugs across 4 projects, and is found to improve performance on all projects. In particular, the technique increases the number of bugs where the rst relevant method presented to developers is the rst result from 6 to 27, and those in the top-10 from 50 to 57, showing that it can be successfully used to enhance existing bug localisation techniques

    Supporting Text Retrieval Query Formulation In Software Engineering

    Get PDF
    The text found in software artifacts captures important information. Text Retrieval (TR) techniques have been successfully used to leverage this information. Despite their advantages, the success of TR techniques strongly depends on the textual queries given as input. When poorly chosen queries are used, developers can waste time investigating irrelevant results. The quality of a query indicates the relevance of the results returned by TR in response to the query and can give an indication if the results are worth investigating or a reformulation of the query should be sought instead. Knowing the quality of the query could lead to time saved when irrelevant results are returned. However, the only way to determine if a query led to the wanted artifacts is by manually inspecting the list of results. This dissertation introduces novel approaches to measure and predict the quality of queries automatically in the context of SE tasks, based on a set of statistical properties of the queries. The approaches are evaluated for the task of concept location in source code. The results reveal that the proposed approaches are able to accurately capture and predict the quality of queries for SE tasks supported by TR. When a query has low quality, the developer can reformulate it and improve it. However, this is just as hard as formulating the query in the first place. This dissertation presents two approaches for partial and complete automation of the query reformulation process. The semi-automatic approach relies on developer feedback about the relevance of TR results and uses this information to automatically reformulate the query. The automatic approach learns and applies the best reformulation approach for a query and relies on a set of training queries and their statistical properties to achieve this. Both approaches are evaluated for concept location and the results show that the techniques are able to improve the results of the original queries in the majority of the cases. We expect that on the long run the proposed approaches will contribute directly to the reduction of developer effort and implicitly the reduction of software evolution costs

    Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks.

    Get PDF
    Information Retrieval (IR) approaches are used to leverage textual or unstructured data generated during the software development process to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.). Two of the most important steps for applying IR techniques to support SE tasks are preprocessing the corpus and configuring the IR technique, and these steps can significantly influence the outcome and the amount of effort developers have to spend for these maintenance tasks. We present the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support SE tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process without requiring any training. We applied IR-GA on three different SE tasks and the results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach and a combinatorial approach

    Bug localisation in ArgoUML, JabRef, JEdit & muCommander

    No full text
    <p>Results from S. Davies, M. Roper and M.Wood, "Using bug report similarity to enhance bug localisation", in Proc. WCRE, 2012 (http://dx.doi.org/10.1109/WCRE.2012.22)</p> <p> </p> <p>Can be interpreted with code from https://bitbucket.org/el_loserio/buglanguagesimilarity/src/wcre_2012_published</p
    corecore