47,121 research outputs found

    A Literature Study of Embeddings on Source Code

    Full text link
    Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future

    IntelliCode Compose: Code Generation Using Transformer

    Full text link
    In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. In this paper, we introduce IntelliCode Compose −- a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#C\#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. Our best model yields an average edit similarity of 86.7%86.7\% and a perplexity of 1.82 for Python programming language.Comment: Accepted for publication at ESEC/FSE conferenc

    Learning Scalable and Precise Representation of Program Semantics

    Full text link
    Neural program embedding has shown potential in aiding the analysis of large-scale, complicated software. Newly proposed deep neural architectures pride themselves on learning program semantics rather than superficial syntactic features. However, by considering the source code only, the vast majority of neural networks do not capture a deep, precise representation of program semantics. In this paper, we present \dypro, a novel deep neural network that learns from program execution traces. Compared to the prior dynamic models, not only is \dypro capable of generalizing across multiple executions for learning a program's dynamic semantics in its entirety, but \dypro is also more efficient when dealing with programs yielding long execution traces. For evaluation, we task \dypro with semantic classification (i.e. categorizing programs based on their semantics) and compared it against two prominent static models: Gated Graph Neural Network and TreeLSTM. We find that \dypro achieves the highest prediction accuracy among all models. To further reveal the capacity of all aforementioned deep neural architectures, we examine if the models can learn to detect deeper semantic properties of a program. In particular given a task of recognizing loop invariants, we show \dypro beats all static models by a wide margin.Comment: 9 page

    Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

    Full text link
    The online programing services, such as Github,TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source codes, so that those source codes can be relatively easy to understand and share among those service users. Among all the source-code mining attempts,program classification lays a foundation for various tasks related to source-code understanding, because it is impossible for a machine to understand a computer program if it cannot classify the program correctly. Although numerous machine learning models, such as the Natural Language Processing (NLP) based models and the Abstract Syntax Tree (AST) based models, have been proposed to classify computer programs based on their corresponding source codes, the existing works cannot fully characterize the source codes from the perspective of both the syntax and semantic information. To address this problem, we proposed a Graph Neural Network (GNN) based model, which integrates data flow and function call information to the AST,and applies an improved GNN model to the integrated graph, so as to achieve the state-of-art program classification accuracy. The experiment results have shown that the proposed work can classify programs with accuracy over 97%.Comment: 12 pages, 27 figure

    Import2vec - Learning Embeddings for Software Libraries

    Full text link
    We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning. We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages ("library vectors"). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).Comment: MSR19 Conference 11 page

    Learning Programmatic Idioms for Scalable Semantic Parsing

    Full text link
    Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art (SOTA) semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and train semantic parsers to apply these idioms during decoding. Applying idiom-based decoding on a recent context-dependent semantic parsing task improves the SOTA by 2.2\% BLEU score while reducing training time by more than 50\%. This improved speed enables us to scale up the model by training on an extended training set that is 5×\times larger, to further move up the SOTA by an additional 2.3\% BLEU and 0.9\% exact match. Finally, idioms also significantly improve accuracy of semantic parsing to SQL on the ATIS-SQL dataset, when training data is limited.Comment: Accepted at EMNLP 201

    A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

    Full text link
    Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for many tasks, such as code plagiarism detection and malware analysis. Today, source code is very often compiled for various architectures, making cross-architecture binary code analysis increasingly important. A binary, after being disassembled, is expressed in an assembly languages. Thus, recent work starts exploring Natural Language Processing (NLP) inspired binary code analysis. In NLP, words are usually represented in high-dimensional vectors (i.e., embeddings) to facilitate further processing, which is one of the most common and critical steps in many NLP tasks. We regard instructions as words in NLP-inspired binary code analysis, and aim to represent instructions as embeddings as well. To facilitate cross-architecture binary code analysis, our goal is that similar instructions, regardless of their architectures, have embeddings close to each other. To this end, we propose a joint learning approach to generating instruction embeddings that capture not only the semantics of instructions within an architecture, but also their semantic relationships across architectures. To the best of our knowledge, this is the first work on building cross-architecture instruction embedding model. As a showcase, we apply the model to resolving one of the most fundamental problems for binary code similarity comparison---semantics-based basic block comparison, and the solution outperforms the code statistics based approach. It demonstrates that it is promising to apply the model to other cross-architecture binary code analysis tasks.Comment: 8 pages, 5 figure

    Predicting Variable Types in Dynamically Typed Programming Languages

    Full text link
    Dynamic Programming Languages are quite popular because they increase the programmer's productivity. However, the absence of types in the source code makes the program written in these languages difficult to understand and virtual machines that execute these programs cannot produced optimized code. To overcome this challenge, we develop a technique to predict types of all identifiers including variables, and function return types. We propose the first implementation of 2nd2^{nd} order Inside Outside Recursive Neural Networks with two variants (i) Child-Sum Tree-LSTMs and (ii) N-ary RNNs that can handle large number of tree branching. We predict the types of all the identifiers given the Abstract Syntax Tree by performing just two passes over the tree, bottom-up and top-down, keeping both the content and context representation for all the nodes of the tree. This allows these representations to interact by combining different paths from the parent, siblings and children which is crucial for predicting types. Our best model achieves 44.33\% across 21 classes and top-3 accuracy of 71.5\% on our gathered Python data set from popular Python benchmarks

    Fast and Memory-Efficient Neural Code Completion

    Full text link
    Code completion is one of the most widely used features of modern integrated development environments (IDEs). While deep learning has made significant progress in the statistical prediction of source code, state-of-the-art neural network models consume hundreds of megabytes of memory, bloating the development environment. We address this in two steps: first we present a modular neural framework for code completion. This allows us to explore the design space and evaluate different techniques. Second, within this framework we design a novel reranking neural completion model that combines static analysis with granular token encodings. The best neural reranking model consumes just 6 MB of RAM, - 19x less than previous models - computes a single completion in 8 ms, and achieves 90% accuracy in its top five suggestions

    MontiCore: a Framework for Compositional Development of Domain Specific Languages

    Full text link
    Domain specific languages (DSLs) are increasingly used today. Coping with complex language definitions, evolving them in a structured way, and ensuring their error freeness are the main challenges of DSL design and implementation. The use of modular language definitions and composition operators are therefore inevitable in the independent development of language components. In this article, we discuss these arising issues by describing a framework for the compositional development of textual DSLs and their supporting tools. We use a redundance-free definition of a readable concrete syntax and a comprehensible abstract syntax as both representations significantly overlap in their structure. For enhancing the usability of the abstract syntax, we added concepts like associations and inheritance to a grammar- based definition in order to build up arbitrary graphs (as known from metamodeling). Two modularity concepts, grammar inheritance and embedding, are discussed. They permit compositional language definition and thus simplify the extension of languages based on already existing ones. We demonstrate that compositional engineering of new languages is a useful concept when project-individual DSLs with appropriate tool support are defined.Comment: 20 pages, 6 figure
    • …
    corecore