Summary descriptions of subroutines are short (usually one-sentence) natural
language explanations of a subroutine's behavior and purpose in a program.
These summaries are ubiquitous in documentation, and many tools such as
JavaDocs and Doxygen generate documentation built around them. And yet,
extracting summaries from unstructured source code repositories remains a
difficult research problem -- it is very difficult to generate clean structured
documentation unless the summaries are annotated by programmers. This becomes a
problem in large repositories of legacy code, since it is cost prohibitive to
retroactively annotate summaries in dozens or hundreds of old programs.
Likewise, it is a problem for creators of automatic documentation generation
algorithms, since these algorithms usually must learn from large annotated
datasets, which do not exist for many programming languages. In this paper, we
present a semi-automated approach via crowdsourcing and a fully-automated
approach for annotating summaries from unstructured code comments. We present
experiments validating the approaches, and provide recommendations and cost
estimates for automatically annotating large repositories.Comment: 10 pages, plus references. Accepted for publication in the 27th IEEE
International Conference on. Software Analysis, Evolution and Reengineering
London, Ontario, Canada, February 18-21, 202