277 research outputs found
Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and Csmith
The correctness of compilers is instrumental in the safety and reliability of
other software systems, as bugs in compilers can produce executables that do
not reflect the intent of programmers. Such errors are difficult to identify
and debug. Random test program generators are commonly used in testing
compilers, and they have been effective in uncovering bugs. However, the
problem of guiding these test generators to produce test programs that are more
likely to find bugs remains challenging. In this paper, we use the code
snippets in the bug reports to guide the test generation. The main idea of this
work is to extract insights from the bug reports about the language features
that are more prone to inadequate implementation and using the insights to
guide the test generators. We use the GCC C compiler to evaluate the
effectiveness of this approach. In particular, we first cluster the test
programs in the GCC bugs reports based on their features. We then use the
centroids of the clusters to compute configurations for Csmith, a popular test
generator for C compilers. We evaluated this approach on eight versions of GCC
and found that our approach provides higher coverage and triggers more
miscompilation failures than the state-of-the-art test generation techniques
for GCC.Comment: The 36th ACM/SIGAPP Symposium on Applied Computing, Software
Verification and Testing Track (SAC-SVT'21
Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations
The abundance of publicly available source code repositories, in conjunction
with the advances in neural networks, has enabled data-driven approaches to
program analysis. These approaches, called neural program analyzers, use neural
networks to extract patterns in the programs for tasks ranging from development
productivity to program reasoning. Despite the growing popularity of neural
program analyzers, the extent to which their results are generalizable is
unknown.
In this paper, we perform a large-scale evaluation of the generalizability of
two popular neural program analyzers using seven semantically-equivalent
transformations of programs. Our results caution that in many cases the neural
program analyzers fail to generalize well, sometimes to programs with
negligible textual differences. The results provide the initial stepping stones
for quantifying robustness in neural program analyzers.Comment: for related work, see arXiv:2008.0156
On Trojan Signatures in Large Language Models of Code
Trojan signatures, as described by Fields et al. (2021), are noticeable
differences in the distribution of the trojaned class parameters (weights) and
the non-trojaned class parameters of the trojaned model, that can be used to
detect the trojaned model. Fields et al. (2021) found trojan signatures in
computer vision classification tasks with image models, such as, Resnet,
WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in
the classifier layer parameters of large language models of source code.
Our results suggest that trojan signatures could not generalize to LLMs of
code. We found that trojaned code models are stubborn, even when the models
were poisoned under more explicit settings (finetuned with pre-trained weights
frozen). We analyzed nine trojaned models for two binary classification tasks:
clone and defect detection. To the best of our knowledge, this is the first
work to examine weight-based trojan signature revelation techniques for
large-language models of code and furthermore to demonstrate that detecting
trojans only from the weights in such models is a hard problem.Comment: This work has been accepted at the International Conference on
Learning Representations 2024 Workshop on Secure and Trustworthy Large
Language Models, SeT LLM @ ICLR 2024 (Vienna, Austria
TrojanedCM: A Repository of Trojaned Large Language Models of Code
With the rapid growth of research in trojaning deep neural models of source
code, we observe that there is a need of developing a benchmark trojaned models
for testing various trojan detection and unlearning techniques. In this work,
we aim to provide the scientific community with diverse trojaned code models,
that cover a variety of state-of-the-art architectures, on which they can
examine such techniques. We thus present TrojanedCM, a publicly available
repository of clean and poisoned models of source code. We provide poisoned
models for two code classification tasks (defect detection and clone detection)
and a code generation task (text-to-code generation). We finetuned popular
pretrained code models such as CodeBERT, PLBART, CodeT5, CodeT5+, on poisoned
datasets that we generated from benchmark datasets (Devign, BigCloneBench,
CONCODE) for the above mentioned tasks. The repository also provides full
access to the architecture and parameters of the models, allowing practitioners
to investigate different white-box analysis techniques. In addition to the
poisoned models, we also provide a poisoning framework using which
practitioners can deploy various poisoning strategies for the different tasks
and models of source code. All the material are accessible via this link:
https://github.com/UH-SERG/TrojanedCM
Study of Distractors in Neural Models of Code
Finding important features that contribute to the prediction of neural models
is an active area of research in explainable AI. Neural models are opaque and
finding such features sheds light on a better understanding of their
predictions. In contrast, in this work, we present an inverse perspective of
distractor features: features that cast doubt about the prediction by affecting
the model's confidence in its prediction. Understanding distractors provide a
complementary view of the features' relevance in the predictions of neural
models. In this paper, we apply a reduction-based technique to find distractors
and provide our preliminary results of their impacts and types. Our experiments
across various tasks, models, and datasets of code reveal that the removal of
tokens can have a significant impact on the confidence of models in their
predictions and the categories of tokens can also play a vital role in the
model's confidence. Our study aims to enhance the transparency of models by
emphasizing those tokens that significantly influence the confidence of the
models.Comment: The 1st International Workshop on Interpretability and Robustness in
Neural Software Engineering, Co-located with ICSE (InteNSE'23
Verification of PCP-Related Computational Reductions in Coq
We formally verify several computational reductions concerning the Post
correspondence problem (PCP) using the proof assistant Coq. Our verifications
include a reduction of a string rewriting problem generalising the halting
problem for Turing machines to PCP, and reductions of PCP to the intersection
problem and the palindrome problem for context-free grammars. Interestingly,
rigorous correctness proofs for some of the reductions are missing in the
literature
Memorization and Generalization in Neural Code Intelligence Models
Deep Neural Networks (DNN) are increasingly commonly used in software
engineering and code intelligence tasks. These are powerful tools that are
capable of learning highly generalizable patterns from large datasets through
millions of parameters. At the same time, training DNNs means walking a knife's
edges, because their large capacity also renders them prone to memorizing data
points. While traditionally thought of as an aspect of over-training, recent
work suggests that the memorization risk manifests especially strongly when the
training datasets are noisy and memorization is the only recourse.
Unfortunately, most code intelligence tasks rely on rather noise-prone and
repetitive data sources, such as GitHub, which, due to their sheer size, cannot
be manually inspected and evaluated. We evaluate the memorization and
generalization tendencies in neural code intelligence models through a case
study across several benchmarks and model families by leveraging established
approaches from other fields that use DNNs, such as introducing targeted noise
into the training dataset. In addition to reinforcing prior general findings
about the extent of memorization in DNNs, our results shed light on the impact
of noisy dataset in training.Comment: manuscript in preparatio
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Although deep neural models substantially reduce the overhead of feature
engineering, the features readily available in the inputs might significantly
impact training cost and the performance of the models. In this paper, we
explore the impact of an unsuperivsed feature enrichment approach based on
variable roles on the performance of neural models of code. The notion of
variable roles (as introduced in the works of Sajaniemi et al. [Refs. 1,2]) has
been found to help students' abilities in programming. In this paper, we
investigate if this notion would improve the performance of neural models of
code. To the best of our knowledge, this is the first work to investigate how
Sajaniemi et al.'s concept of variable roles can affect neural models of code.
In particular, we enrich a source code dataset by adding the role of individual
variables in the dataset programs, and thereby conduct a study on the impact of
variable role enrichment in training the Code2Seq model. In addition, we shed
light on some challenges and opportunities in feature enrichment for neural
code intelligence models.Comment: Accepted in the 1st International Workshop on Interpretability and
Robustness in Neural Software Engineering (InteNSE'23), Co-located with ICS
- …