78 research outputs found
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
Statistical language modeling techniques have successfully been applied to
large source code corpora, yielding a variety of new software development
tools, such as tools for code suggestion, improving readability, and API
migration. A major issue with these techniques is that code introduces new
vocabulary at a far higher rate than natural language, as new identifier names
proliferate. Both large vocabularies and out-of-vocabulary issues severely
affect Neural Language Models (NLMs) of source code, degrading their
performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling
choices impact the resulting vocabulary on a large-scale corpus of 13,362
projects; 2) presenting an open vocabulary source code NLM that can scale to
such a corpus, 100 times larger than in previous work; and 3) showing that such
models outperform the state of the art on three distinct code corpora (Java, C,
Python). To our knowledge, these are the largest NLMs for code that have been
reported.
All datasets, code, and trained models used in this work are publicly
available.Comment: 13 pages; to appear in Proceedings of ICSE 202
Convolutional Neural Networks over Tree Structures for Programming Language Processing
Programming language processing (similar to natural language processing) is a
hot research topic in the field of software engineering; it has also aroused
growing interest in the artificial intelligence community. However, different
from a natural language sentence, a program contains rich, explicit, and
complicated structural information. Hence, traditional NLP models may be
inappropriate for programs. In this paper, we propose a novel tree-based
convolutional neural network (TBCNN) for programming language processing, in
which a convolution kernel is designed over programs' abstract syntax trees to
capture structural information. TBCNN is a generic architecture for programming
language processing; our experiments show its effectiveness in two different
program analysis tasks: classifying programs according to functionality, and
detecting code snippets of certain patterns. TBCNN outperforms baseline
methods, including several neural models for NLP.Comment: Accepted at AAAI-1
- …