Source-code Summarization of Java Methods Using Control-Flow Graphs

Abstract

Source-code summarization aims to generate natural-language summaries for software artifacts (e.g., method and class). % Researchers have been exploring source-code summarization as one research area in software engineering. Various research works showed the use of text-retrieval-based techniques, heuristic-based techniques, and data-driven techniques for source-code summarization. In data-driven techniques, researchers used a sequence of source-code tokens and other representations of source code (e.g., application programming interface (API) sequences and abstract syntax tree (AST)) as an input to source-code summarization models. According to the current published literature in source-code summarization, researchers have not explored the use of a sequence extracted from control-flow graph that shows a contextual relationship between program instructions based on control-flow relationships for source-code summarization models. In this work, we employ control-flow graph representations to increase the prediction accuracy of a bi-directional long-short term memory (LSTM) source-code summarization model in terms of describing the functionality of Java methods. We use an attention-based bi-directional LSTM sequence-to-sequence model to show the use of linearized control-flow graph sequences alongside a sequence of source-code tokens. We compared our model with the current state-of-the-art and with or without a linearized control-flow graph. We created a source-code summarization dataset to train and evaluate our approach and conducted expert and automatic evaluations. In the expert evaluation, the participants gave rating for summaries generated by each model in terms of correctly describing the functionality of a Java method. Our models outperformed the state-of-the-art in terms of the mean average-rating. Also, the expert evaluation showed us the model benefit from the structural information. In the automatic evaluation, we found that the use of control-flow graphs does not increase the prediction accuracy of a bi-directional LSTM model in terms of BLEU score compared to a bi-directional LSTM model that does not use control-flow graphs. However, we found our source-code summarization approach that uses a control-flow graph as an additional representation better than encoding AST in graph neural networks. Overall, we improved the state-of-the-art for method summarization with our models that take sequence of method tokens with and without a control-flow graph

    Similar works