47,121 research outputs found
A Literature Study of Embeddings on Source Code
Natural language processing has improved tremendously after the success of
word embedding techniques such as word2vec. Recently, the same idea has been
applied on source code with encouraging results. In this survey, we aim to
collect and discuss the usage of word embedding techniques on programs and
source code. The articles in this survey have been collected by asking authors
of related work and with an extensive search on Google Scholar. Each article is
categorized into five categories: 1. embedding of tokens 2. embedding of
functions or methods 3. embedding of sequences or sets of method calls 4.
embedding of binary code 5. other embeddings. We also provide links to
experimental data and show some remarkable visualization of code embeddings. In
summary, word embedding has been successfully applied on different
granularities of source code. With access to countless open-source
repositories, we see a great potential of applying other data-driven natural
language processing techniques on source code in the future
IntelliCode Compose: Code Generation Using Transformer
In software development through integrated development environments (IDEs),
code completion is one of the most widely used features. Nevertheless, majority
of integrated development environments only support completion of methods and
APIs, or arguments.
In this paper, we introduce IntelliCode Compose a general-purpose
multilingual code completion tool which is capable of predicting sequences of
code tokens of arbitrary types, generating up to entire lines of syntactically
correct code. It leverages state-of-the-art generative transformer model
trained on 1.2 billion lines of source code in Python, , JavaScript and
TypeScript programming languages. IntelliCode Compose is deployed as a
cloud-based web service. It makes use of client-side tree-based caching,
efficient parallel implementation of the beam search decoder, and compute graph
optimizations to meet edit-time completion suggestion requirements in the
Visual Studio Code IDE and Azure Notebook.
Our best model yields an average edit similarity of and a perplexity
of 1.82 for Python programming language.Comment: Accepted for publication at ESEC/FSE conferenc
Learning Scalable and Precise Representation of Program Semantics
Neural program embedding has shown potential in aiding the analysis of
large-scale, complicated software. Newly proposed deep neural architectures
pride themselves on learning program semantics rather than superficial
syntactic features. However, by considering the source code only, the vast
majority of neural networks do not capture a deep, precise representation of
program semantics. In this paper, we present \dypro, a novel deep neural
network that learns from program execution traces. Compared to the prior
dynamic models, not only is \dypro capable of generalizing across multiple
executions for learning a program's dynamic semantics in its entirety, but
\dypro is also more efficient when dealing with programs yielding long
execution traces. For evaluation, we task \dypro with semantic classification
(i.e. categorizing programs based on their semantics) and compared it against
two prominent static models: Gated Graph Neural Network and TreeLSTM. We find
that \dypro achieves the highest prediction accuracy among all models. To
further reveal the capacity of all aforementioned deep neural architectures, we
examine if the models can learn to detect deeper semantic properties of a
program. In particular given a task of recognizing loop invariants, we show
\dypro beats all static models by a wide margin.Comment: 9 page
Program Classification Using Gated Graph Attention Neural Network for Online Programming Service
The online programing services, such as Github,TopCoder, and EduCoder, have
promoted a lot of social interactions among the service users. However, the
existing social interactions is rather limited and inefficient due to the rapid
increasing of source-code repositories, which is difficult to explore manually.
The emergence of source-code mining provides a promising way to analyze those
source codes, so that those source codes can be relatively easy to understand
and share among those service users. Among all the source-code mining
attempts,program classification lays a foundation for various tasks related to
source-code understanding, because it is impossible for a machine to understand
a computer program if it cannot classify the program correctly. Although
numerous machine learning models, such as the Natural Language Processing (NLP)
based models and the Abstract Syntax Tree (AST) based models, have been
proposed to classify computer programs based on their corresponding source
codes, the existing works cannot fully characterize the source codes from the
perspective of both the syntax and semantic information. To address this
problem, we proposed a Graph Neural Network (GNN) based model, which integrates
data flow and function call information to the AST,and applies an improved GNN
model to the integrated graph, so as to achieve the state-of-art program
classification accuracy. The experiment results have shown that the proposed
work can classify programs with accuracy over 97%.Comment: 12 pages, 27 figure
Import2vec - Learning Embeddings for Software Libraries
We consider the problem of developing suitable learning representations
(embeddings) for library packages that capture semantic similarity among
libraries. Such representations are known to improve the performance of
downstream learning tasks (e.g. classification) or applications such as
contextual search and analogical reasoning.
We apply word embedding techniques from natural language processing (NLP) to
train embeddings for library packages ("library vectors"). Library vectors
represent libraries by similar context of use as determined by import
statements present in source code. Experimental results obtained from training
such embeddings on three large open source software corpora reveals that
library vectors capture semantically meaningful relationships among software
libraries, such as the relationship between frameworks and their plug-ins and
libraries commonly used together within ecosystems such as big data
infrastructure projects (in Java), front-end and back-end web development
frameworks (in JavaScript) and data science toolkits (in Python).Comment: MSR19 Conference 11 page
Learning Programmatic Idioms for Scalable Semantic Parsing
Programmers typically organize executable source code using high-level coding
patterns or idiomatic structures such as nested loops, exception handlers and
recursive blocks, rather than as individual code tokens. In contrast, state of
the art (SOTA) semantic parsers still map natural language instructions to
source code by building the code syntax tree one node at a time. In this paper,
we introduce an iterative method to extract code idioms from large source code
corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax
trees, and train semantic parsers to apply these idioms during decoding.
Applying idiom-based decoding on a recent context-dependent semantic parsing
task improves the SOTA by 2.2\% BLEU score while reducing training time by more
than 50\%. This improved speed enables us to scale up the model by training on
an extended training set that is 5 larger, to further move up the SOTA
by an additional 2.3\% BLEU and 0.9\% exact match. Finally, idioms also
significantly improve accuracy of semantic parsing to SQL on the ATIS-SQL
dataset, when training data is limited.Comment: Accepted at EMNLP 201
A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis
Given a closed-source program, such as most of proprietary software and
viruses, binary code analysis is indispensable for many tasks, such as code
plagiarism detection and malware analysis. Today, source code is very often
compiled for various architectures, making cross-architecture binary code
analysis increasingly important. A binary, after being disassembled, is
expressed in an assembly languages. Thus, recent work starts exploring Natural
Language Processing (NLP) inspired binary code analysis. In NLP, words are
usually represented in high-dimensional vectors (i.e., embeddings) to
facilitate further processing, which is one of the most common and critical
steps in many NLP tasks. We regard instructions as words in NLP-inspired binary
code analysis, and aim to represent instructions as embeddings as well.
To facilitate cross-architecture binary code analysis, our goal is that
similar instructions, regardless of their architectures, have embeddings close
to each other. To this end, we propose a joint learning approach to generating
instruction embeddings that capture not only the semantics of instructions
within an architecture, but also their semantic relationships across
architectures. To the best of our knowledge, this is the first work on building
cross-architecture instruction embedding model. As a showcase, we apply the
model to resolving one of the most fundamental problems for binary code
similarity comparison---semantics-based basic block comparison, and the
solution outperforms the code statistics based approach. It demonstrates that
it is promising to apply the model to other cross-architecture binary code
analysis tasks.Comment: 8 pages, 5 figure
Predicting Variable Types in Dynamically Typed Programming Languages
Dynamic Programming Languages are quite popular because they increase the
programmer's productivity. However, the absence of types in the source code
makes the program written in these languages difficult to understand and
virtual machines that execute these programs cannot produced optimized code. To
overcome this challenge, we develop a technique to predict types of all
identifiers including variables, and function return types.
We propose the first implementation of order Inside Outside
Recursive Neural Networks with two variants (i) Child-Sum Tree-LSTMs and (ii)
N-ary RNNs that can handle large number of tree branching. We predict the types
of all the identifiers given the Abstract Syntax Tree by performing just two
passes over the tree, bottom-up and top-down, keeping both the content and
context representation for all the nodes of the tree. This allows these
representations to interact by combining different paths from the parent,
siblings and children which is crucial for predicting types. Our best model
achieves 44.33\% across 21 classes and top-3 accuracy of 71.5\% on our gathered
Python data set from popular Python benchmarks
Fast and Memory-Efficient Neural Code Completion
Code completion is one of the most widely used features of modern integrated
development environments (IDEs). While deep learning has made significant
progress in the statistical prediction of source code, state-of-the-art neural
network models consume hundreds of megabytes of memory, bloating the
development environment. We address this in two steps: first we present a
modular neural framework for code completion. This allows us to explore the
design space and evaluate different techniques. Second, within this framework
we design a novel reranking neural completion model that combines static
analysis with granular token encodings. The best neural reranking model
consumes just 6 MB of RAM, - 19x less than previous models - computes a single
completion in 8 ms, and achieves 90% accuracy in its top five suggestions
MontiCore: a Framework for Compositional Development of Domain Specific Languages
Domain specific languages (DSLs) are increasingly used today. Coping with
complex language definitions, evolving them in a structured way, and ensuring
their error freeness are the main challenges of DSL design and implementation.
The use of modular language definitions and composition operators are therefore
inevitable in the independent development of language components. In this
article, we discuss these arising issues by describing a framework for the
compositional development of textual DSLs and their supporting tools. We use a
redundance-free definition of a readable concrete syntax and a comprehensible
abstract syntax as both representations significantly overlap in their
structure. For enhancing the usability of the abstract syntax, we added
concepts like associations and inheritance to a grammar- based definition in
order to build up arbitrary graphs (as known from metamodeling). Two modularity
concepts, grammar inheritance and embedding, are discussed. They permit
compositional language definition and thus simplify the extension of languages
based on already existing ones. We demonstrate that compositional engineering
of new languages is a useful concept when project-individual DSLs with
appropriate tool support are defined.Comment: 20 pages, 6 figure
- …