10 research outputs found
Code Prediction by Feeding Trees to Transformers
We advance the state-of-the-art in the accuracy of code prediction (next
token prediction) used in autocomplete systems. First, we report that using the
recently proposed Transformer architecture even out-of-the-box outperforms
previous neural and non-neural systems for code prediction. We then show that
by making the Transformer architecture aware of the syntactic structure of
code, we further increase the margin by which a Transformer-based system
outperforms previous systems. With this, it outperforms the accuracy of an
RNN-based system (similar to Hellendoorn et al. 2018) by 18.3\%, the Deep3
system (Raychev et al 2016) by 14.1\%, and an adaptation of Code2Seq (Alon et
al., 2018) for code prediction by 14.4\%.
We present in the paper several ways of communicating the code structure to
the Transformer, which is fundamentally built for processing sequence data. We
provide a comprehensive experimental evaluation of our proposal, along with
alternative design choices, on a standard Python dataset, as well as on a
Facebook internal Python corpus. Our code and data preparation pipeline will be
available in open source
A Mocktail of Source Code Representations
Efficient representation of source code is essential for various software
engineering tasks such as code search and code clone detection. One such
technique for representing source code involves extracting paths from the AST
and using a learning model to capture program properties. Code2vec is a
commonly used path-based approach that uses an attention-based neural network
to learn code embeddings which can then be used for various software
engineering tasks. However, this approach uses only ASTs and does not leverage
other graph structures such as Control Flow Graphs (CFG) and Program Dependency
Graphs (PDG). Similarly, most recent approaches for representing source code
still use AST and do not leverage semantic graph structures. Even though there
exists an integrated graph approach (Code Property Graph) for representing
source code, it has only been explored in the domain of software security.
Moreover, it does not leverage the paths from the individual graphs. In our
work, we extend the path-based approach code2vec to include semantic graphs,
CFG, and PDG, along with AST, which is still largely unexplored in the domain
of software engineering. We evaluate our approach on the task of MethodNaming
using a custom C dataset of 730K methods collected from 16 C projects from
GitHub. In comparison to code2vec, our approach improves the F1 Score by 11% on
the full dataset and up to 100% with individual projects. We show that semantic
features from the CFG and PDG paths are indeed helpful. We envision that
looking at a mocktail of source code representations for various software
engineering tasks can lay the foundation for a new line of research and a
re-haul of existing research
Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs
Code completion has become an essential component of integrated development
environments. Contemporary code completion methods rely on the abstract syntax
tree (AST) to generate syntactically correct code. However, they cannot fully
capture the sequential and repetitive patterns of writing code and the
structural information of the AST. To alleviate these problems, we propose a
new code completion approach named CCAG, which models the flattened sequence of
a partial AST as an AST graph. CCAG uses our proposed AST Graph Attention Block
to capture different dependencies in the AST graph for representation learning
in code completion. The sub-tasks of code completion are optimized via
multi-task learning in CCAG, and the task balance is automatically achieved
using uncertainty without the need to tune task weights. The experimental
results show that CCAG has superior performance than state-of-the-art
approaches and it is able to provide intelligent code completion.Comment: Accepted in AAAI 2021. This version contains the appendix for the
derivation of Eq. 1
LExecutor: Learning-Guided Execution
Executing code is essential for various program analysis tasks, e.g., to
detect bugs that manifest through exceptions or to obtain execution traces for
further dynamic analysis. However, executing an arbitrary piece of code is
often difficult in practice, e.g., because of missing variable definitions,
missing user inputs, and missing third-party dependencies. This paper presents
LExecutor, a learning-guided approach for executing arbitrary code snippets in
an underconstrained way. The key idea is to let a neural model predict missing
values that otherwise would cause the program to get stuck, and to inject these
values into the execution. For example, LExecutor injects likely values for
otherwise undefined variables and likely return values of calls to otherwise
missing functions. We evaluate the approach on Python code from popular
open-source projects and on code snippets extracted from Stack Overflow. The
neural model predicts realistic values with an accuracy between 79.5% and
98.2%, allowing LExecutor to closely mimic real executions. As a result, the
approach successfully executes significantly more code than any available
technique, such as simply executing the code as-is. For example, executing the
open-source code snippets as-is covers only 4.1% of all lines, because the code
crashes early on, whereas LExecutor achieves a coverage of 51.6%.Comment: Accepted in research track of the ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE) 202
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
Deep learning (DL) techniques are gaining more and more attention in the
software engineering community. They have been used to support several
code-related tasks, such as automatic bug fixing and code comments generation.
Recent studies in the Natural Language Processing (NLP) field have shown that
the Text-To-Text Transfer Transformer (T5) architecture can achieve
state-of-the-art performance for a variety of NLP tasks. The basic idea behind
T5 is to first pre-train a model on a large and generic dataset using a
self-supervised task ( e.g: filling masked words in sentences). Once the model
is pre-trained, it is fine-tuned on smaller and specialized datasets, each one
related to a specific task ( e.g: language translation, sentence
classification). In this paper, we empirically investigate how the T5 model
performs when pre-trained and fine-tuned to support code-related tasks. We
pre-train a T5 model on a dataset composed of natural language English text and
source code. Then, we fine-tune such a model by reusing datasets used in four
previous works that used DL techniques to: (i) fix bugs, (ii) inject code
mutants, (iii) generate assert statements, and (iv) generate code comments. We
compared the performance of this single model with the results reported in the
four original papers proposing DL-based solutions for those four tasks. We show
that our T5 model, exploiting additional data for the self-supervised
pre-training phase, can achieve performance improvements over the four
baselines.Comment: Accepted to the 43rd International Conference on Software Engineering
(ICSE 2021