121 research outputs found
Towards Automatic Generation of Short Summaries of Commits
Committing to a version control system means submitting a software change to
the system. Each commit can have a message to describe the submission. Several
approaches have been proposed to automatically generate the content of such
messages. However, the quality of the automatically generated messages falls
far short of what humans write. In studying the differences between
auto-generated and human-written messages, we found that 82% of the
human-written messages have only one sentence, while the automatically
generated messages often have multiple lines. Furthermore, we found that the
commit messages often begin with a verb followed by an direct object. This
finding inspired us to use a "verb+object" format in this paper to generate
short commit summaries. We split the approach into two parts: verb generation
and object generation. As our first try, we trained a classifier to classify a
diff to a verb. We are seeking feedback from the community before we continue
to work on generating direct objects for the commits.Comment: 4 pages, accepted in ICPC 2017 ERA Trac
A Neural Model for Generating Natural Language Summaries of Program Subroutines
Source code summarization -- creating natural language descriptions of source
code behavior -- is a rapidly-growing research topic with applications to
automatic documentation generation, program comprehension, and software
maintenance. Traditional techniques relied on heuristics and templates built
manually by human experts. Recently, data-driven approaches based on neural
machine translation have largely overtaken template-based systems. But nearly
all of these techniques rely almost entirely on programs having good internal
documentation; without clear identifier names, the models fail to create good
summaries. In this paper, we present a neural model that combines words from
code with code structure from an AST. Unlike previous approaches, our model
processes each data source as a separate input, which allows the model to learn
code structure independent of the text in code. This process helps our approach
provide coherent summaries in many cases even when zero internal documentation
is provided. We evaluate our technique with a dataset we created from 2.1m Java
methods. We find improvement over two baseline techniques from SE literature
and one from NLP literature
Searching, Selecting, and Synthesizing Source Code Components
As programmers develop software, they instinctively sense that source code exists that could be reused if found --- many programming tasks are common to many software projects across different domains. oftentimes, a programmer will attempt to create new software from this existing source code, such as third-party libraries or code from online repositories. Unfortunately, several major challenges make it difficult to locate the relevant source code and to reuse it. First, there is a fundamental mismatch between the high-level intent reflected in the descriptions of source code, and the low-level implementation details. This mismatch is known as the concept assignment problem , and refers to the frequent case when the keywords from comments or identifiers in code do not match the features implemented in the code. Second, even if relevant source code is found, programmers must invest significant intellectual effort into understanding how to reuse the different functions, classes, or other components present in the source code. These components may be specific to a particular application, and difficult to reuse.;One key source of information that programmers use to understand source code is the set of relationships among the source code components. These relationships are typically structural data, such as function calls or class instantiations. This structural data has been repeatedly suggested as an alternative to textual analysis for search and reuse, however as yet no comprehensive strategy exists for locating relevant and reusable source code. In my research program, I harness this structural data in a unified approach to creating and evolving software from existing components. For locating relevant source code, I present a search engine for finding applications based on the underlying Application Programming Interface (API) calls, and a technique for finding chains of relevant function invocations from repositories of millions of lines of code. Next, for reusing source code, I introduce a system to facilitate building software prototypes from existing packages, and an approach to detecting similar software applications
Detecting Important Terms in Source Code for Program Comprehension
Software Engineering research has become extremely dependent on terms (words in textual data) extracted from source code. Different techniques have been proposed to extract the most important\u27\u27 terms from code. These terms are typically used as input to research prototypes: the quality of the output of these prototypes will depend on the quality of the term extraction technique. At present no consensus exists about which technique predicts the best terms for code comprehension. We perform a literature review, and propose a unified prediction model based on a Naive Bayes algorithm. We evaluate our model in a field study with professional programmers, as well as a standard 10-fold synthetic study. We found our model predicts the top quartile of the most-important terms with approximately 50% precision and recall, outperforming other popular techniques. We found the predictions from our model to help programmers to the same degree as the gold set
Distilled GPT for Source Code Summarization
A code summary is a brief natural language description of source code.
Summaries are usually only a single sentence long, and yet form the backbone of
developer documentation. A short descriptions such as "changes all visible
polygons to the color blue" can give a programmer a high-level idea of what
code does without the effort of reading the code itself. Recently, products
based on Large Language Models such as ChatGPT have demonstrated a strong
ability to write these descriptions automatically. However, to use these tools,
programmers must send their code to untrusted third parties for processing
(e.g., via an API call). This loss of custody is not acceptable to many
organizations. In this paper, we present an alternative: we train an open
source model using sample output generated by GPT-3.5 in a process related to
knowledge distillation. Our model is small enough (350m parameters) to be run
on a single 16gb GPU, yet we show in our evaluation that it is large enough to
mimic GPT-3.5 on this task.Comment: 19 pages + 6 figures. Accepted to Automated Software Engineering
Journa
Semantic Similarity Loss for Neural Source Code Summarization
This paper presents an improved loss function for neural source code
summarization. Code summarization is the task of writing natural language
descriptions of source code. Neural code summarization refers to automated
techniques for generating these descriptions using neural networks. Almost all
current approaches involve neural networks as either standalone models or as
part of a pretrained large language models e.g., GPT, Codex, LLaMA. Yet almost
all also use a categorical cross-entropy (CCE) loss function for network
optimization. Two problems with CCE are that 1) it computes loss over each word
prediction one-at-a-time, rather than evaluating a whole sentence, and 2) it
requires a perfect prediction, leaving no room for partial credit for synonyms.
We propose and evaluate a loss function to alleviate this problem. In essence,
we propose to use a semantic similarity metric to calculate loss over the whole
output sentence prediction per training batch, rather than just loss for each
word. We also propose to combine our loss with traditional CCE for each word,
which streamlines the training process compared to baselines. We evaluate our
approach over several baselines and report an improvement in the vast majority
of conditions.Comment: 20 pages + 8 figures + 5 references. Preprint In Review Aug. 202
Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments
Summary descriptions of subroutines are short (usually one-sentence) natural
language explanations of a subroutine's behavior and purpose in a program.
These summaries are ubiquitous in documentation, and many tools such as
JavaDocs and Doxygen generate documentation built around them. And yet,
extracting summaries from unstructured source code repositories remains a
difficult research problem -- it is very difficult to generate clean structured
documentation unless the summaries are annotated by programmers. This becomes a
problem in large repositories of legacy code, since it is cost prohibitive to
retroactively annotate summaries in dozens or hundreds of old programs.
Likewise, it is a problem for creators of automatic documentation generation
algorithms, since these algorithms usually must learn from large annotated
datasets, which do not exist for many programming languages. In this paper, we
present a semi-automated approach via crowdsourcing and a fully-automated
approach for annotating summaries from unstructured code comments. We present
experiments validating the approaches, and provide recommendations and cost
estimates for automatically annotating large repositories.Comment: 10 pages, plus references. Accepted for publication in the 27th IEEE
International Conference on. Software Analysis, Evolution and Reengineering
London, Ontario, Canada, February 18-21, 202
- …