24,704 research outputs found

    Code Prediction by Feeding Trees to Transformers

    Full text link
    We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. First, we report that using the recently proposed Transformer architecture even out-of-the-box outperforms previous neural and non-neural systems for code prediction. We then show that by making the Transformer architecture aware of the syntactic structure of code, we further increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of an RNN-based system (similar to Hellendoorn et al. 2018) by 18.3\%, the Deep3 system (Raychev et al 2016) by 14.1\%, and an adaptation of Code2Seq (Alon et al., 2018) for code prediction by 14.4\%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source

    Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation

    Full text link
    This paper demonstrates that word sense disambiguation (WSD) can improve neural machine translation (NMT) by widening the source context considered when modeling the senses of potentially ambiguous words. We first introduce three adaptive clustering algorithms for WSD, based on k-means, Chinese restaurant processes, and random walks, which are then applied to large word contexts represented in a low-rank space and evaluated on SemEval shared-task data. We then learn word vectors jointly with sense vectors defined by our best WSD method, within a state-of-the-art NMT system. We show that the concatenation of these vectors, and the use of a sense selection mechanism based on the weighted average of sense vectors, outperforms several baselines including sense-aware ones. This is demonstrated by translation on five language pairs. The improvements are above one BLEU point over strong NMT baselines, +4% accuracy over all ambiguous nouns and verbs, or +20% when scored manually over several challenging words.Comment: To appear in TAC

    On the naturalness of software

    Get PDF
    Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area
    • …
    corecore