Search CORE

1,489 research outputs found

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Author: Barone Antonio Valerio Miceli
Sennrich Rico
Publication venue
Publication date: 07/07/2017
Field of study

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.Comment: 5 pages, 1 figure, 3 table

arXiv.org e-Print Archive

Edinburgh Research Explorer

Learning Semantic Correspondences in Technical Documentation

Author: Kuhn Jonas
Richardson Kyle
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.Comment: accepted to ACL-201

arXiv.org e-Print Archive

Crossref

Polyglot Semantic Parsing in APIs

Author: Berant Jonathan
Kuhn Jonas
Richardson Kyle
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.Comment: accepted for NAACL-2018 (camera ready version

arXiv.org e-Print Archive

Crossref

Generating completion suggestions for source code comments using neural language models

Author: Ciurumelea Nicoleta Adelina
Publication venue
Publication date: 01/01/2023
Field of study

ZORA

CodeExp: Explanatory Code Document Generation

Author: Cui Haotian
Duan Nan
Gao Jianfeng
Huang Junjie
Inala Jeevana Priya
Mytkowicz Todd
Wang Bo
Wang Chenglong
Publication venue
Publication date: 25/11/2022
Field of study

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.Comment: Accepted in Findings of EMNLP 202

arXiv.org e-Print Archive

RAMLFlask: Managing artifact coupling for Web APIs

Author: Maurer Michael
Sobernig Stefan
Strembeck Mark
Publication venue
Publication date: 15/12/2019
Field of study

Elektronische Publikationen der Wirtschaftsuniversität Wien