365 research outputs found
Locating bugs without looking back
Bug localisation is a core program comprehension task in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source code files as the documents to be retrieved, ranked by relevance. Such approaches have the advantage of not requiring expensive static or dynamic analysis of the code. However, current state-of-the-art IR approaches rely on project history, in particular previously fixed bugs or previous versions of the source code. We present a novel approach that directly scores each current file against the given report, thus not requiring past code and reports. The scoring method is based on heuristics identified through manual inspection of a small sample of bug reports. We compare our approach to eight others, using their own five metrics on their own six open source projects. Out of 30 performance indicators, we improve 27 and equal 2. Over the projects analysed, on average we find one or more affected files in the top 10 ranked files for 76% of the bug reports. These results show the applicability of our approach to software projects without history
Supporting Source Code Search with Context-Aware and Semantics-Driven Query Reformulation
Software bugs and failures cost trillions of dollars every year, and could even lead to deadly accidents (e.g., Therac-25 accident). During maintenance, software developers fix numerous bugs and implement hundreds of new features by making necessary changes to the existing software code. Once an issue report (e.g., bug report, change request) is assigned to a developer, she chooses a few important keywords from the report as a search query, and then attempts to find out the exact locations in the software code that need to be either repaired or enhanced. As a part of this maintenance, developers also often select ad hoc queries on the fly, and attempt to locate the reusable code from the Internet that could assist them either in bug fixing or in feature implementation. Unfortunately, even the experienced developers often fail to construct the right search queries. Even if the developers come up with a few ad hoc queries, most of them require frequent modifications which cost significant development time and efforts. Thus, construction of an appropriate query for localizing the software bugs, programming concepts or even the reusable code is a major challenge. In this thesis, we overcome this query construction challenge with six studies, and develop a novel, effective code search solution (BugDoctor) that assists the developers in localizing the software code of interest (e.g., bugs, concepts and reusable code) during software maintenance. In particular, we reformulate a given search query (1) by designing novel keyword selection algorithms (e.g., CodeRank) that outperform the traditional alternatives (e.g., TF-IDF), (2) by leveraging the bug report quality paradigm and source document structures which were previously overlooked and (3) by exploiting the crowd knowledge and word semantics derived from Stack Overflow Q&A site, which were previously untapped. Our experiment using 5000+ search queries (bug reports, change requests, and ad hoc queries) suggests that our proposed approach can improve the given queries significantly through automated query reformulations. Comparison with 10+ existing studies on bug localization, concept location and Internet-scale code search suggests that our approach can outperform the state-of-the-art approaches with a significant margin
A Systematic Review of Automated Query Reformulations in Source Code Search
Fixing software bugs and adding new features are two of the major maintenance
tasks. Software bugs and features are reported as change requests. Developers
consult these requests and often choose a few keywords from them as an ad hoc
query. Then they execute the query with a search engine to find the exact
locations within software code that need to be changed. Unfortunately, even
experienced developers often fail to choose appropriate queries, which leads to
costly trials and errors during a code search. Over the years, many studies
attempt to reformulate the ad hoc queries from developers to support them. In
this systematic literature review, we carefully select 70 primary studies on
query reformulations from 2,970 candidate studies, perform an in-depth
qualitative analysis (e.g., Grounded Theory), and then answer seven research
questions with major findings. First, to date, eight major methodologies (e.g.,
term weighting, term co-occurrence analysis, thesaurus lookup) have been
adopted to reformulate queries. Second, the existing studies suffer from
several major limitations (e.g., lack of generalizability, vocabulary mismatch
problem, subjective bias) that might prevent their wide adoption. Finally, we
discuss the best practices and future opportunities to advance the state of
research in search query reformulations.Comment: 81 pages, accepted at TOSE
Recommended from our members
Improving Information Retrieval Bug Localisation Using Contextual Heuristics
Software developers working on unfamiliar systems are challenged to identify where and how high-level concepts are implemented in the source code prior to performing maintenance tasks. Bug localisation is a core program comprehension activity in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code?
Information retrieval (IR) approaches see the bug report as the query, and the source files as the documents to be retrieved, ranked by relevance. Current approaches rely on project history, in particular previously fixed bugs and versions of the source code. Existing IR techniques fall short of providing adequate solutions in finding all the source code files relevant for a bug. Without additional help, bug localisation can become a tedious, time- consuming and error-prone task.
My research contributes a novel algorithm that, given a bug report and the application’s source files, uses a combination of lexical and structural information to suggest, in a ranked order, files that may have to be changed to resolve the reported bug without requiring past code and similar reports.
I study eight applications for which I had access to the user guide, the source code, and some bug reports. I compare the relative importance and the occurrence of the domain concepts in the project artefacts and measure the effectiveness of using only concept key words to locate files relevant for a bug compared to using all the words of a bug report.
Measuring my approach against six others, using their five metrics and eight projects, I position an effected file in the top-1, top-5 and top-10 ranks on average for 44%, 69% and 76% of the bug reports respectively. This is an improvement of 23%, 16% and 11% respectively over the best performing current state-of-the-art tool.
Finally, I evaluate my algorithm with a range of industrial applications in user studies, and found that it is superior to simple string search, as often performed by developers. These results show the applicability of my approach to software projects without history and offers a simpler light-weight solution
Source Code Retrieval from Large Software Libraries for Automatic Bug Localization
This dissertation advances the state-of-the-art in information retrieval (IR) based approaches to automatic bug localization in software. In an IR-based approach, one first creates a search engine using a probabilistic or a deterministic model for the files in a software library. Subsequently, a bug report is treated as a query to the search engine for retrieving the files relevant to the bug. With regard to the new work presented, we first demonstrate the importance of taking version histories of the files into account for achieving significant improvements in the precision with which the files related to a bug are located. This is motivated by the realization that the files that have not changed in a long time are likely to have ``stabilized and are therefore less likely to contain bugs. Subsequently, we look at the difficulties created by the fact that developers frequently use abbreviations and concatenations that are not likely to be familiar to someone trying to locate the files related to a bug. We show how an initial query can be automatically reformulated to include the relevant actual terms in the files by an analysis of the files retrieved in response to the original query for terms that are proximal to the original query terms. The last part of this dissertation generalizes our term-proximity based work by using Markov Random Fields (MRF) to model the inter-term dependencies in a query vis-a-vis the files. Our MRF work redresses one of the major defects of the most commonly used modeling approaches in IR, which is the loss of all inter-term relationships in the documents
ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair
With the growing interest on Large Language Models (LLMs) for fault
localization and program repair, ensuring the integrity and generalizability of
the LLM-based methods becomes paramount. The code in existing widely-adopted
benchmarks for these tasks was written before the the bloom of LLMs and may be
included in the training data of existing popular LLMs, thereby suffering from
the threat of data leakage, leading to misleadingly optimistic performance
metrics. To address this issue, we introduce "ConDefects", a novel dataset of
real faults meticulously curated to eliminate such overlap. ConDefects contains
1,254 Java faulty programs and 1,625 Python faulty programs. All these programs
are sourced from the online competition platform AtCoder and were produced
between October 2021 and September 2023. We pair each fault with fault
locations and the corresponding repaired code versions, making it tailored for
in fault localization and program repair related research. We also provide
interfaces for selecting subsets based on different time windows and coding
task difficulties. While inspired by LLM-based tasks, ConDefects can be adopted
for benchmarking ALL types of fault localization and program repair methods.
The dataset is publicly available, and a demo video can be found at
https://www.youtube.com/watch?v=22j15Hj5ONk.Comment: 5pages, 3 figure
- …