1,676 research outputs found
tgp: An R Package for Bayesian Nonstationary, Semiparametric Nonlinear Regression and Design by Treed Gaussian Process Models
The tgp package for R is a tool for fully Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian processes with jumps to the limiting linear model. Special cases also implemented include Bayesian linear models, linear CART, stationary separable and isotropic Gaussian processes. In addition to inference and posterior prediction, the package supports the (sequential) design of experiments under these models paired with several objective criteria. 1-d and 2-d plotting, with higher dimension projection and slice capabilities, and tree drawing functions (requiring maptree and combinat packages), are also provided for visualization of tgp objects.
Method-Level Bug Severity Prediction using Source Code Metrics and LLMs
In the past couple of decades, significant research efforts are devoted to
the prediction of software bugs. However, most existing work in this domain
treats all bugs the same, which is not the case in practice. It is important
for a defect prediction method to estimate the severity of the identified bugs
so that the higher-severity ones get immediate attention. In this study, we
investigate source code metrics, source code representation using large
language models (LLMs), and their combination in predicting bug severity labels
of two prominent datasets. We leverage several source metrics at method-level
granularity to train eight different machine-learning models. Our results
suggest that Decision Tree and Random Forest models outperform other models
regarding our several evaluation metrics. We then use the pre-trained CodeBERT
LLM to study the source code representations' effectiveness in predicting bug
severity. CodeBERT finetuning improves the bug severity prediction results
significantly in the range of 29%-140% for several evaluation metrics, compared
to the best classic prediction model on source code metric. Finally, we
integrate source code metrics into CodeBERT as an additional input, using our
two proposed architectures, which both enhance the CodeBERT model
effectiveness
Mining Action Rules for Defect Reduction Planning
Defect reduction planning plays a vital role in enhancing software quality
and minimizing software maintenance costs. By training a black box machine
learning model and "explaining" its predictions, explainable AI for software
engineering aims to identify the code characteristics that impact maintenance
risks. However, post-hoc explanations do not always faithfully reflect what the
original model computes. In this paper, we introduce CounterACT, a
Counterfactual ACTion rule mining approach that can generate defect reduction
plans without black-box models. By leveraging action rules, CounterACT provides
a course of action that can be considered as a counterfactual explanation for
the class (e.g., buggy or not buggy) assigned to a piece of code. We compare
the effectiveness of CounterACT with the original action rule mining algorithm
and six established defect reduction approaches on 9 software projects. Our
evaluation is based on (a) overlap scores between proposed code changes and
actual developer modifications; (b) improvement scores in future releases; and
(c) the precision, recall, and F1-score of the plans. Our results show that,
compared to competing approaches, CounterACT's explainable plans achieve higher
overlap scores at the release level (median 95%) and commit level (median
85.97%), and they offer better trade-off between precision and recall (median
F1-score 88.12%). Finally, we venture beyond planning and explore leveraging
Large Language models (LLM) for generating code edits from our generated plans.
Our results show that suggested LLM code edits supported by our plans are
actionable and are more likely to pass relevant test cases than vanilla LLM
code recommendations
A Comparative Study of Contemporary Learning Paradigms in Bug Report Priority Detection
The increasing complexity of software development demands efficient automated bug report priority classification, and recent advancements in deep learning hold promise. This paper presents a comparative study of contemporary learning paradigms, including BERT, vector databases, large language models (LLMs), and a simple novel learning paradigm, contrastive learning for BERT. Utilizing datasets from bug reports, movie reviews, and app reviews, we evaluate and compare the performance of each approach. We find that transformer encoder-only models outperform in classification tasks measured by the precision, recall, and F1 score transformer decoder-only models despite an order of magnitude gap between the number of parameters. The novel use of contrastive learning for BERT demonstrates promising results in capturing subtle nuances in text data. This work highlights the potential of advanced NLP techniques for automated bug report priority classification and underscores the importance of considering multiple factors when developing models for this task. The paper’s main contributions are a comprehensive evaluation of various learning paradigms, such as vector databases and LLMs, an introduction of contrastive learning for BERT, an exploration of applicability to other text classification tasks, and a contrastive learning procedure that exploits ordinal information between classes
Explainable Automated Debugging via Large Language Model-driven Scientific Debugging
Automated debugging techniques have the potential to reduce developer effort
in debugging, and have matured enough to be adopted by industry. However, one
critical issue with existing techniques is that, while developers want
rationales for the provided automatic debugging results, existing techniques
are ill-suited to provide them, as their deduction process differs
significantly from that of human developers. Inspired by the way developers
interact with code when debugging, we propose Automated Scientific Debugging
(AutoSD), a technique that given buggy code and a bug-revealing test, prompts
large language models to automatically generate hypotheses, uses debuggers to
actively interact with buggy code, and thus automatically reach conclusions
prior to patch generation. By aligning the reasoning of automated debugging
more closely with that of human developers, we aim to produce intelligible
explanations of how a specific patch has been generated, with the hope that the
explanation will lead to more efficient and accurate developer decisions. Our
empirical analysis on three program repair benchmarks shows that AutoSD
performs competitively with other program repair baselines, and that it can
indicate when it is confident in its results. Furthermore, we perform a human
study with 20 participants, including six professional developers, to evaluate
the utility of explanations from AutoSD. Participants with access to
explanations could judge patch correctness in roughly the same time as those
without, but their accuracy improved for five out of six real-world bugs
studied: 70% of participants answered that they wanted explanations when using
repair tools, while 55% answered that they were satisfied with the Scientific
Debugging presentation
LHC Dark Matter Signals from Vector Resonances and Top Partners
Extensions of the Standard Model which address the hierarchy problem and dark
matter (DM) often contain top partners and additional resonances at the TeV
scale. We explore the phenomenology of a simplified effective model with a
vector resonance , a fermionic vector-like coloured partner of the top
quark as well as a scalar DM candidate and provide publicly
available implementations in CalcHEP and MadGraph. We study the process at the LHC and find that it
plays an important role in addition to the production via
strong interactions. It turns out that the presence of the can provide a
dominant contribution to the signature without
conflicting with existing bounds from searches in di-jet and di-lepton
final states. We find that through this process, the LHC is already probing DM
masses up to about 900 GeV and top partner masses up to about 1.5 TeV, thus
exceeding the current bounds from QCD production alone almost by a factor of
two for both particles.Comment: 32 pages, 15 figures, 3 table
A Deep Dive into Large Language Models for Automated Bug Localization and Repair
Large language models (LLMs) have shown impressive effectiveness in various
software engineering tasks, including automated program repair (APR). In this
study, we take a deep dive into automated bug fixing utilizing LLMs. In
contrast to many deep learning-based APR methods that assume known bug
locations, rely on line-level localization tools, or address bug prediction and
fixing in one step, our approach uniquely employs LLMs to predict bug location
at the token level and subsequently utilizes them for bug fixing. This
methodological separation of bug localization and fixing using different LLMs
enables effective integration of diverse contextual information and improved
incorporation of inductive biases. We introduce Toggle: Token-Granulated Bug
Localization and Repair, a comprehensive program repair framework that
integrates a bug localization model, an adjustment unit, and a bug-fixing
model. Toggle takes a buggy function as input and generates a complete
corrected function. We investigate various styles of prompting to the bug
fixing model to identify the most effective prompts that better utilize the
inductive bias and significantly outperform others. Toggle achieves the new
state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark,
and exhibits better and comparable performance on several other widely-used APR
datasets, including Defects4J
A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization
Fault Localization (FL), in which a developer seeks to identify which part of
the code is malfunctioning and needs to be fixed, is a recurring challenge in
debugging. To reduce developer burden, many automated FL techniques have been
proposed. However, prior work has noted that existing techniques fail to
provide rationales for the suggested locations, hindering developer adoption of
these techniques. With this in mind, we propose AutoFL, a Large Language Model
(LLM)-based FL technique that generates an explanation of the bug along with a
suggested fault location. AutoFL prompts an LLM to use function calls to
navigate a repository, so that it can effectively localize faults over a large
software repository and overcome the limit of the LLM context length. Extensive
experiments on 798 real-world bugs in Java and Python reveal AutoFL improves
method-level acc@1 by up to 233.3% over baselines. Furthermore, developers were
interviewed on their impression of AutoFL-generated explanations, showing that
developers generally liked the natural language explanations of AutoFL, and
that they preferred reading a few, high-quality explanations instead of many.Comment: Accepted to ACM International Conference on the Foundations of
Software Engineering (FSE 2024
A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection
Large Language Models (LLMs) have demonstrated great potential for code
generation and other software engineering tasks. Vulnerability detection is of
crucial importance to maintaining the security, integrity, and trustworthiness
of software systems. Precise vulnerability detection requires reasoning about
the code, making it a good case study for exploring the limits of LLMs'
reasoning capabilities. Although recent work has applied LLMs to vulnerability
detection using generic prompting techniques, their full capabilities for this
task and the types of errors they make when explaining identified
vulnerabilities remain unclear.
In this paper, we surveyed eleven LLMs that are state-of-the-art in code
generation and commonly used as coding assistants, and evaluated their
capabilities for vulnerability detection. We systematically searched for the
best-performing prompts, incorporating techniques such as in-context learning
and chain-of-thought, and proposed three of our own prompting methods. Our
results show that while our prompting methods improved the models' performance,
LLMs generally struggled with vulnerability detection. They reported 0.5-0.63
Balanced Accuracy and failed to distinguish between buggy and fixed versions of
programs in 76% of cases on average. By comprehensively analyzing and
categorizing 287 instances of model reasoning, we found that 57% of LLM
responses contained errors, and the models frequently predicted incorrect
locations of buggy code and misidentified bug types. LLMs only correctly
localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted
correctly by 70-100% of human participants. These findings suggest that despite
their potential for other tasks, LLMs may fail to properly comprehend critical
code structures and security-related concepts. Our data and code are available
at https://figshare.com/s/78fe02e56e09ec49300b
BugBlitz-AI: An Intelligent QA Assistant
The evolution of software testing from manual to automated methods has
significantly influenced quality assurance (QA) practices. However, challenges
persist in post-execution phases, particularly in result analysis and
reporting. Traditional post-execution validation phases require manual
intervention for result analysis and report generation, leading to
inefficiencies and potential development cycle delays. This paper introduces
BugBlitz-AI, an AI-powered validation toolkit designed to enhance end-to-end
test automation by automating result analysis and bug reporting processes.
BugBlitz-AI leverages recent advancements in artificial intelligence to reduce
the time-intensive tasks of manual result analysis and report generation,
allowing QA teams to focus more on crucial aspects of product quality. By
adopting BugBlitz-AI, organizations can advance automated testing practices and
integrate AI into QA processes, ensuring higher product quality and faster
time-to-market. The paper outlines BugBlitz-AI's architecture, discusses
related work, details its quality enhancement strategies, and presents results
demonstrating its effectiveness in real-world scenarios
- …
