53 research outputs found
Structured Generative Models of Natural Source Code
We study the problem of building generative models of natural source code
(NSC); that is, source code written and understood by humans. Our primary
contribution is to describe a family of generative models for NSC that have
three key properties: First, they incorporate both sequential and hierarchical
structure. Second, we learn a distributed representation of source code
elements. Finally, they integrate closely with a compiler, which allows
leveraging compiler logic and abstractions when building structure into the
model. We also develop an extension that includes more complex structure,
refining how the model generates identifier tokens based on what variables are
currently in scope. Our models can be learned efficiently, and we show
empirically that including appropriate structure greatly improves the models,
measured by the probability of generating test programs
Learning Scalable and Precise Representation of Program Semantics
Neural program embedding has shown potential in aiding the analysis of
large-scale, complicated software. Newly proposed deep neural architectures
pride themselves on learning program semantics rather than superficial
syntactic features. However, by considering the source code only, the vast
majority of neural networks do not capture a deep, precise representation of
program semantics. In this paper, we present \dypro, a novel deep neural
network that learns from program execution traces. Compared to the prior
dynamic models, not only is \dypro capable of generalizing across multiple
executions for learning a program's dynamic semantics in its entirety, but
\dypro is also more efficient when dealing with programs yielding long
execution traces. For evaluation, we task \dypro with semantic classification
(i.e. categorizing programs based on their semantics) and compared it against
two prominent static models: Gated Graph Neural Network and TreeLSTM. We find
that \dypro achieves the highest prediction accuracy among all models. To
further reveal the capacity of all aforementioned deep neural architectures, we
examine if the models can learn to detect deeper semantic properties of a
program. In particular given a task of recognizing loop invariants, we show
\dypro beats all static models by a wide margin.Comment: 9 page
Review Networks for Caption Generation
We propose a novel extension of the encoder-decoder framework, called a
review network. The review network is generic and can enhance any existing
encoder- decoder model: in this paper, we consider RNN decoders with both CNN
and RNN encoders. The review network performs a number of review steps with
attention mechanism on the encoder hidden states, and outputs a thought vector
after each review step; the thought vectors are used as the input of the
attention mechanism in the decoder. We show that conventional encoder-decoders
are a special case of our framework. Empirically, we show that our framework
improves over state-of- the-art encoder-decoder systems on the tasks of image
captioning and source code captioning.Comment: NIPS 201
COSET: A Benchmark for Evaluating Neural Program Embeddings
Neural program embedding can be helpful in analyzing large software, a task
that is challenging for traditional logic-based program analyses due to their
limited scalability. A key focus of recent machine-learning advances in this
area is on modeling program semantics instead of just syntax. Unfortunately
evaluating such advances is not obvious, as program semantics does not lend
itself to straightforward metrics. In this paper, we introduce a benchmarking
framework called COSET for standardizing the evaluation of neural program
embeddings. COSET consists of a diverse dataset of programs in source-code
format, labeled by human experts according to a number of program properties of
interest. A point of novelty is a suite of program transformations included in
COSET. These transformations when applied to the base dataset can simulate
natural changes to program code due to optimization and refactoring and can
serve as a "debugging" tool for classification mistakes. We conducted a pilot
study on four prominent models: TreeLSTM, gated graph neural network (GGNN),
AST-Path neural network (APNN), and DYPRO. We found that COSET is useful in
identifying the strengths and limitations of each model and in pinpointing
specific syntactic and semantic characteristics of programs that pose
challenges.Comment: 8 Page
Neural Attribute Grammars for Semantics-Guided Program Generation
Existing deep models for code tend to be trained on syntactic program
representations. We present an alternative, called Neural Attribute Grammars,
that exposes the semantics of the target language to the training procedure
using an attribute grammar. During training, our model learns to replicate the
relationship between the syntactic rules used to construct a program, and the
semantic attributes (for example, symbol tables) constructed from the context
in which the rules are fired. We implement the approach as a system for
conditional generation of Java programs modulo eleven natural requirements. Our
experiments show that the system generates constraint-abiding programs with
significantly higher frequency than a baseline model trained on syntactic
program representations, and also in terms of generation accuracy
Learning Probabilistic Programs
We develop a technique for generalising from data in which models are
samplers represented as program text. We establish encouraging empirical
results that suggest that Markov chain Monte Carlo probabilistic programming
inference techniques coupled with higher-order probabilistic programming
languages are now sufficiently powerful to enable successful inference of this
kind in nontrivial domains. We also introduce a new notion of probabilistic
program compilation and show how the same machinery might be used in the future
to compile probabilistic programs for efficient reusable predictive inference
Learning to Represent Programs with Graphs
Learning tasks on source code (i.e., formal languages) have been considered
recently, but most work has tried to transfer natural language methods and does
not capitalize on the unique opportunities offered by code's known syntax. For
example, long-range dependencies induced by using the same variable or function
in distant locations are often not considered. We propose to use graphs to
represent both the syntactic and semantic structure of code and use graph-based
deep learning methods to learn to reason over program structures.
In this work, we present how to construct graphs from source code and how to
scale Gated Graph Neural Networks training to such large graphs. We evaluate
our method on two tasks: VarNaming, in which a network attempts to predict the
name of a variable given its usage, and VarMisuse, in which the network learns
to reason about selecting the correct variable that should be used at a given
program location. Our comparison to methods that use less structured program
representations shows the advantages of modeling known structure, and suggests
that our models learn to infer meaningful names and to solve the VarMisuse task
in many cases. Additionally, our testing showed that VarMisuse identifies a
number of bugs in mature open-source projects.Comment: Published in ICLR 2018. arXiv admin note: text overlap with
arXiv:1705.0786
CodeGRU: Context-aware Deep Learning with Gated Recurrent Unit for Source Code Modeling
Recently deep learning based Natural Language Processing (NLP) models have
shown great potential in the modeling of source code. However, a major
limitation of these approaches is that they take source code as simple tokens
of text and ignore its contextual, syntactical and structural dependencies. In
this work, we present CodeGRU, a gated recurrent unit based source code
language model that is capable of capturing source code's contextual,
syntactical and structural dependencies. We introduce a novel approach which
can capture the source code context by leveraging the source code token types.
Further, we adopt a novel approach which can learn variable size context by
taking into account source code's syntax, and structural information. We
evaluate CodeGRU with real-world data set and it shows that CodeGRU outperforms
the state-of-the-art language models and help reduce the vocabulary size up to
24.93\%. Unlike previous works, we tested CodeGRU with an independent test set
which suggests that our methodology does not requisite the source code comes
from the same domain as training data while providing suggestions. We further
evaluate CodeGRU with two software engineering applications: source code
suggestion, and source code completion. Our experiment confirms that the source
code's contextual information can be vital and can help improve the software
language models. The extensive evaluation of CodeGRU shows that it outperforms
the state-of-the-art models. The results further suggest that the proposed
approach can help reduce the vocabulary size and is of practical use for
software developers
Generative Code Modeling with Graphs
Generative models for source code are an interesting structured prediction
problem, requiring to reason about both hard syntactic and semantic constraints
as well as about natural, likely programs. We present a novel model for this
problem that uses a graph to represent the intermediate state of the generated
output. The generative procedure interleaves grammar-driven expansion steps
with graph augmentation and neural message passing steps. An experimental
evaluation shows that our new model can generate semantically meaningful
expressions, outperforming a range of strong baselines
Program Classification Using Gated Graph Attention Neural Network for Online Programming Service
The online programing services, such as Github,TopCoder, and EduCoder, have
promoted a lot of social interactions among the service users. However, the
existing social interactions is rather limited and inefficient due to the rapid
increasing of source-code repositories, which is difficult to explore manually.
The emergence of source-code mining provides a promising way to analyze those
source codes, so that those source codes can be relatively easy to understand
and share among those service users. Among all the source-code mining
attempts,program classification lays a foundation for various tasks related to
source-code understanding, because it is impossible for a machine to understand
a computer program if it cannot classify the program correctly. Although
numerous machine learning models, such as the Natural Language Processing (NLP)
based models and the Abstract Syntax Tree (AST) based models, have been
proposed to classify computer programs based on their corresponding source
codes, the existing works cannot fully characterize the source codes from the
perspective of both the syntax and semantic information. To address this
problem, we proposed a Graph Neural Network (GNN) based model, which integrates
data flow and function call information to the AST,and applies an improved GNN
model to the integrated graph, so as to achieve the state-of-art program
classification accuracy. The experiment results have shown that the proposed
work can classify programs with accuracy over 97%.Comment: 12 pages, 27 figure
- …