32 research outputs found
Assessing the Generalizability of code2vec Token Embeddings
Many Natural Language Processing (NLP) tasks, such as sentiment analysis or syntactic parsing, have benefited from the development of word embedding models. In particular, regardless of the training algorithms, the learned embeddings have often been shown to be generalizable to different NLP tasks. In contrast, despite recent momentum on word embeddings for source code, the literature lacks evidence of their generalizability beyond the example task they have been trained for. In this experience paper, we identify 3 potential downstream tasks, namely code comments generation, code authorship identification, and code clones detection, that source code token
embedding models can be applied to. We empirically assess a recently proposed code token embedding model, namely code2vecâs token embeddings. Code2vec was trained on the task of predicting method names, and while there is potential for using the vectors it learns on other tasks, it has not been explored in literature. Therefore, we fill this gap by focusing on its generalizability
for the tasks we have identified. Eventually, we show that source code token embeddings cannot be readily leveraged for the downstream tasks. Our experiments even show that our attempts to use them do not result in any improvements over less sophisticated methods. We call for more research into effective and general use of code embeddings
Study of Distractors in Neural Models of Code
Finding important features that contribute to the prediction of neural models
is an active area of research in explainable AI. Neural models are opaque and
finding such features sheds light on a better understanding of their
predictions. In contrast, in this work, we present an inverse perspective of
distractor features: features that cast doubt about the prediction by affecting
the model's confidence in its prediction. Understanding distractors provide a
complementary view of the features' relevance in the predictions of neural
models. In this paper, we apply a reduction-based technique to find distractors
and provide our preliminary results of their impacts and types. Our experiments
across various tasks, models, and datasets of code reveal that the removal of
tokens can have a significant impact on the confidence of models in their
predictions and the categories of tokens can also play a vital role in the
model's confidence. Our study aims to enhance the transparency of models by
emphasizing those tokens that significantly influence the confidence of the
models.Comment: The 1st International Workshop on Interpretability and Robustness in
Neural Software Engineering, Co-located with ICSE (InteNSE'23
Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations
The abundance of publicly available source code repositories, in conjunction
with the advances in neural networks, has enabled data-driven approaches to
program analysis. These approaches, called neural program analyzers, use neural
networks to extract patterns in the programs for tasks ranging from development
productivity to program reasoning. Despite the growing popularity of neural
program analyzers, the extent to which their results are generalizable is
unknown.
In this paper, we perform a large-scale evaluation of the generalizability of
two popular neural program analyzers using seven semantically-equivalent
transformations of programs. Our results caution that in many cases the neural
program analyzers fail to generalize well, sometimes to programs with
negligible textual differences. The results provide the initial stepping stones
for quantifying robustness in neural program analyzers.Comment: for related work, see arXiv:2008.0156
Memorization and Generalization in Neural Code Intelligence Models
Deep Neural Networks (DNN) are increasingly commonly used in software
engineering and code intelligence tasks. These are powerful tools that are
capable of learning highly generalizable patterns from large datasets through
millions of parameters. At the same time, training DNNs means walking a knife's
edges, because their large capacity also renders them prone to memorizing data
points. While traditionally thought of as an aspect of over-training, recent
work suggests that the memorization risk manifests especially strongly when the
training datasets are noisy and memorization is the only recourse.
Unfortunately, most code intelligence tasks rely on rather noise-prone and
repetitive data sources, such as GitHub, which, due to their sheer size, cannot
be manually inspected and evaluated. We evaluate the memorization and
generalization tendencies in neural code intelligence models through a case
study across several benchmarks and model families by leveraging established
approaches from other fields that use DNNs, such as introducing targeted noise
into the training dataset. In addition to reinforcing prior general findings
about the extent of memorization in DNNs, our results shed light on the impact
of noisy dataset in training.Comment: manuscript in preparatio
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree transformation
Large-scale language models have made great progress in the field of software
engineering in recent years. They can be used for many code-related tasks such
as code clone detection, code-to-code search, and method name prediction.
However, these large-scale language models based on each code token have
several drawbacks: They are usually large in scale, heavily dependent on
labels, and require a lot of computing power and time to fine-tune new
datasets.Furthermore, code embedding should be performed on the entire code
snippet rather than encoding each code token. The main reason for this is that
encoding each code token would cause model parameter inflation, resulting in a
lot of parameters storing information that we are not very concerned about. In
this paper, we propose a novel framework, called TransformCode, that learns
about code embeddings in a contrastive learning manner. The framework uses the
Transformer encoder as an integral part of the model. We also introduce a novel
data augmentation technique called abstract syntax tree transformation: This
technique applies syntactic and semantic transformations to the original code
snippets to generate more diverse and robust anchor samples. Our proposed
framework is both flexible and adaptable: It can be easily extended to other
downstream tasks that require code representation such as code clone detection
and classification. The framework is also very efficient and scalable: It does
not require a large model or a large amount of training data, and can support
any programming language.Finally, our framework is not limited to unsupervised
learning, but can also be applied to some supervised learning tasks by
incorporating task-specific labels or objectives. To explore the effectiveness
of our framework, we conducted extensive experiments on different software
engineering tasks using different programming languages and multiple datasets
Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair
A large body of the literature of automated program repair develops
approaches where patches are generated to be validated against an oracle (e.g.,
a test suite). Because such an oracle can be imperfect, the generated patches,
although validated by the oracle, may actually be incorrect. While the state of
the art explore research directions that require dynamic information or rely on
manually-crafted heuristics, we study the benefit of learning code
representations to learn deep features that may encode the properties of patch
correctness. Our work mainly investigates different representation learning
approaches for code changes to derive embeddings that are amenable to
similarity computations. We report on findings based on embeddings produced by
pre-trained and re-trained neural networks. Experimental results demonstrate
the potential of embeddings to empower learning algorithms in reasoning about
patch correctness: a machine learning predictor with BERT transformer-based
embeddings associated with logistic regression yielded an AUC value of about
0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled
patches. Our study shows that learned representations can lead to reasonable
performance when comparing against the state-of-the-art, PATCH-SIM, which
relies on dynamic information. These representations may further be
complementary to features that were carefully (manually) engineered in the
literature
InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode
Embedding Java classes with code2vec: improvements from variable obfuscation
Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared1 for further ML research on source code