1,751 research outputs found
Do Pre-trained Language Models Indeed Understand Software Engineering Tasks?
Artificial intelligence (AI) for software engineering (SE) tasks has recently
achieved promising performance. In this paper, we investigate to what extent
the pre-trained language model truly understands those SE tasks such as code
search, code summarization, etc. We conduct a comprehensive empirical study on
a board set of AI for SE (AI4SE) tasks by feeding them with variant inputs: 1)
with various masking rates and 2) with sufficient input subset method. Then,
the trained models are evaluated on different SE tasks, including code search,
code summarization, and duplicate bug report detection. Our experimental
results show that pre-trained language models are insensitive to the given
input, thus they achieve similar performance in these three SE tasks. We refer
to this phenomenon as overinterpretation, where a model confidently makes a
decision without salient features, or where a model finds some irrelevant
relationships between the final decision and the dataset. Our study
investigates two approaches to mitigate the overinterpretation phenomenon:
whole word mask strategy and ensembling. To the best of our knowledge, we are
the first to reveal this overinterpretation phenomenon to the AI4SE community,
which is an important reminder for researchers to design the input for the
models and calls for necessary future work in understanding and implementing
AI4SE tasks.Comment: arXiv admin note: text overlap with arXiv:2202.08005 by other author
Cupid: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection
Duplicate bug report detection (DBRD) is a long-standing challenge in both
academia and industry. Over the past decades, researchers have proposed various
approaches to detect duplicate bug reports more accurately. With the recent
advancement of deep learning, researchers have also proposed several approaches
that leverage deep learning models to detect duplicate bug reports. A recent
benchmarking study on DBRD also reveals that the performance of deep
learning-based approaches is not always better than the traditional approaches.
However, traditional approaches have limitations, e.g., they are usually based
on the bag-of-words model, which cannot capture the semantics of bug reports.
To address these aforementioned challenges, we seek to leverage
state-of-the-art large language model to improve the performance of the
traditional DBRD approach.
In this paper, we propose an approach called Cupid, which combines the
best-performing traditional DBRD approach REP with the state-of-the-art large
language model ChatGPT. Specifically, we first leverage ChatGPT under the
zero-shot setting to get essential information on bug reports. We then use the
essential information as the input of REP to detect duplicate bug reports. We
conducted an evaluation on comparing Cupid with three existing approaches on
three datasets. The experimental results show that Cupid achieves new
state-of-the-art results, reaching Recall Rate@10 scores ranging from 0.59 to
0.67 across all the datasets analyzed. Our work highlights the potential of
combining large language models to improve the performance of software
engineering tasks.Comment: Work in progres
A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports
Bug reports are an essential aspect of software development, and it is
crucial to identify and resolve them quickly to ensure the consistent
functioning of software systems. Retrieving similar bug reports from an
existing database can help reduce the time and effort required to resolve bugs.
In this paper, we compared the effectiveness of semantic textual similarity
methods for retrieving similar bug reports based on a similarity score. We
explored several embedding models such as TF-IDF (Baseline), FastText, Gensim,
BERT, and ADA. We used the Software Defects Data containing bug reports for
various software projects to evaluate the performance of these models. Our
experimental results showed that BERT generally outperformed the rest of the
models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our
study provides insights into the effectiveness of different embedding methods
for retrieving similar bug reports and highlights the impact of selecting the
appropriate one for this task. Our code is available on GitHub.Comment: 7 Page
Bug Fix Time Optimization Using Matrix Factorization and Iterative Gale-Shaply Algorithms
Bug triage is an essential task in software maintenance phase. It assigns
developers (fixers) to bug reports to fix them. This process is performed
manually by a triager, who analyzes developers profiles and submitted bug
reports to make suitable assignments. Bug triaging process is time consuming
thus automating this process is essential to improve the quality of software.
Previous work addressed triaging problem either as an information retrieval or
classification problem. This paper tackles this problem as a resource
allocation problem, that aims at the best assignments of developers to bug
reports, that reduces the total fixing time of the newly submitted bug reports,
in addition to the even distribution of bug reports over developers. In this
paper, a combination of matrix factorization and Gale Shapely algorithm,
supported by the differential evolution is firstly introduced to optimize the
total fix time and normalize developers work load. Matrix factorization is used
to establish a recommendation system for Gale-Shapley to make assignment
decisions. Differential evolution provides the best set of weights to build
developers score profiles. The proposed approach is assessed over three
repositories, Linux, Apache and Eclipse. Experimental results show that the
proposed approach reduces the bug fixing time, in comparison to the manual
triage, by 80.67%, 23.61% and 60.22% over Linux, Eclipse and Apache
respectively. Moreover, the workload for the developers is uniform.Comment: 14 page, 7 figures, 8 tables, 10 equation
Large Language Models for Software Engineering: A Systematic Literature Review
Large Language Models (LLMs) have significantly impacted numerous domains,
notably including Software Engineering (SE). Nevertheless, a well-rounded
understanding of the application, effects, and possible limitations of LLMs
within SE is still in its early stages. To bridge this gap, our systematic
literature review takes a deep dive into the intersection of LLMs and SE, with
a particular focus on understanding how LLMs can be exploited in SE to optimize
processes and outcomes. Through a comprehensive review approach, we collect and
analyze a total of 229 research papers from 2017 to 2023 to answer four key
research questions (RQs). In RQ1, we categorize and provide a comparative
analysis of different LLMs that have been employed in SE tasks, laying out
their distinctive features and uses. For RQ2, we detail the methods involved in
data collection, preprocessing, and application in this realm, shedding light
on the critical role of robust, well-curated datasets for successful LLM
implementation. RQ3 allows us to examine the specific SE tasks where LLMs have
shown remarkable success, illuminating their practical contributions to the
field. Finally, RQ4 investigates the strategies employed to optimize and
evaluate the performance of LLMs in SE, as well as the common techniques
related to prompt optimization. Armed with insights drawn from addressing the
aforementioned RQs, we sketch a picture of the current state-of-the-art,
pinpointing trends, identifying gaps in existing research, and flagging
promising areas for future study
Employing Deep Learning and Structured Information Retrieval to Answer Clarification Questions on Bug Reports
Software bug reports reported on bug-tracking systems often lack crucial
information for the developers to promptly resolve them, costing companies
billions of dollars. There has been significant research on effectively
eliciting information from bug reporters in bug tracking systems using
different templates that bug reporters need to use. However, the need for
asking follow-up questions persists. Recent studies propose techniques to
suggest these follow-up questions to help developers obtain the missing
details, but there has been little research on answering these follow up
questions, which are often unanswered. In this paper, we propose a novel
approach that uses CodeT5 in combination with Lucene, an information retrieval
technique that leverages the relevance of different bug reports, their
components, and follow-up questions to recommend answers. These top-performing
answers, along with their bug report, serve as additional context apart from
the deficient bug report to the deep learning model for generating an answer.
We evaluate our recommended answers with the manually annotated answers using
similarity metrics like Normalized Smooth BLEU Score, METEOR, Word Mover's
Distance, and Semantic Similarity. We achieve a BLEU Score of up to 34 and
Semantic Similarity of up to 64 which shows that the answers generated are
understandable and good according to Google's standard and can outperform
multiple baselines.Comment: Fixed formatting and typographical error
- …