653 research outputs found
A Neural Model for Generating Natural Language Summaries of Program Subroutines
Source code summarization -- creating natural language descriptions of source
code behavior -- is a rapidly-growing research topic with applications to
automatic documentation generation, program comprehension, and software
maintenance. Traditional techniques relied on heuristics and templates built
manually by human experts. Recently, data-driven approaches based on neural
machine translation have largely overtaken template-based systems. But nearly
all of these techniques rely almost entirely on programs having good internal
documentation; without clear identifier names, the models fail to create good
summaries. In this paper, we present a neural model that combines words from
code with code structure from an AST. Unlike previous approaches, our model
processes each data source as a separate input, which allows the model to learn
code structure independent of the text in code. This process helps our approach
provide coherent summaries in many cases even when zero internal documentation
is provided. We evaluate our technique with a dataset we created from 2.1m Java
methods. We find improvement over two baseline techniques from SE literature
and one from NLP literature
Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments
Summary descriptions of subroutines are short (usually one-sentence) natural
language explanations of a subroutine's behavior and purpose in a program.
These summaries are ubiquitous in documentation, and many tools such as
JavaDocs and Doxygen generate documentation built around them. And yet,
extracting summaries from unstructured source code repositories remains a
difficult research problem -- it is very difficult to generate clean structured
documentation unless the summaries are annotated by programmers. This becomes a
problem in large repositories of legacy code, since it is cost prohibitive to
retroactively annotate summaries in dozens or hundreds of old programs.
Likewise, it is a problem for creators of automatic documentation generation
algorithms, since these algorithms usually must learn from large annotated
datasets, which do not exist for many programming languages. In this paper, we
present a semi-automated approach via crowdsourcing and a fully-automated
approach for annotating summaries from unstructured code comments. We present
experiments validating the approaches, and provide recommendations and cost
estimates for automatically annotating large repositories.Comment: 10 pages, plus references. Accepted for publication in the 27th IEEE
International Conference on. Software Analysis, Evolution and Reengineering
London, Ontario, Canada, February 18-21, 202
Revisiting File Context for Source Code Summarization
Source code summarization is the task of writing natural language
descriptions of source code. A typical use case is generating short summaries
of subroutines for use in API documentation. The heart of almost all current
research into code summarization is the encoder-decoder neural architecture,
and the encoder input is almost always a single subroutine or other short code
snippet. The problem with this setup is that the information needed to describe
the code is often not present in the code itself -- that information often
resides in other nearby code. In this paper, we revisit the idea of ``file
context'' for code summarization. File context is the idea of encoding select
information from other subroutines in the same file. We propose a novel
modification of the Transformer architecture that is purpose-built to encode
file context and demonstrate its improvement over several baselines. We find
that file context helps on a subset of challenging examples where traditional
approaches struggle.Comment: 27 pages + references, Under peer revie
Label Smoothing Improves Neural Source Code Summarization
Label smoothing is a regularization technique for neural networks. Normally
neural models are trained to an output distribution that is a vector with a
single 1 for the correct prediction, and 0 for all other elements. Label
smoothing converts the correct prediction location to something slightly less
than 1, then distributes the remainder to the other elements such that they are
slightly greater than 0. A conceptual explanation behind label smoothing is
that it helps prevent a neural model from becoming "overconfident" by forcing
it to consider alternatives, even if only slightly. Label smoothing has been
shown to help several areas of language generation, yet typically requires
considerable tuning and testing to achieve the optimal results. This tuning and
testing has not been reported for neural source code summarization - a growing
research area in software engineering that seeks to generate natural language
descriptions of source code behavior. In this paper, we demonstrate the effect
of label smoothing on several baselines in neural code summarization, and
conduct an experiment to find good parameters for label smoothing and make
recommendations for its use
Semantic Similarity Loss for Neural Source Code Summarization
This paper presents an improved loss function for neural source code
summarization. Code summarization is the task of writing natural language
descriptions of source code. Neural code summarization refers to automated
techniques for generating these descriptions using neural networks. Almost all
current approaches involve neural networks as either standalone models or as
part of a pretrained large language models e.g., GPT, Codex, LLaMA. Yet almost
all also use a categorical cross-entropy (CCE) loss function for network
optimization. Two problems with CCE are that 1) it computes loss over each word
prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it
requires a perfect prediction, leaving no room for partial credit for synonyms.
We propose and evaluate a loss function to alleviate this problem. In essence,
we propose to use a semantic similarity metric to calculate loss over the whole
output sentence prediction per training batch, rather than just loss for each
word. We also propose to combine our loss with traditional CCE for each word,
which streamlines the training process compared to baselines. We evaluate our
approach over several baselines and report an improvement in the vast majority
of conditions.Comment: 20 pages + 8 figures + 5 references. Preprint In Review Aug. 202
- …