Search CORE

3 research outputs found

Are your comments outdated? Towards automatically detecting code-comment consistency

Author: Chen Xiangping
Chen Yinan
Huang Yuan
Zhou Xiaocong
Publication venue
Publication date: 29/02/2024
Field of study

In software development and maintenance, code comments can help developers understand source code, and improve communication among developers. However, developers sometimes neglect to update the corresponding comment when changing the code, resulting in outdated comments (i.e., inconsistent codes and comments). Outdated comments are dangerous and harmful and may mislead subsequent developers. More seriously, the outdated comments may lead to a fatal flaw sometime in the future. To automatically identify the outdated comments in source code, we proposed a learning-based method, called CoCC, to detect the consistency between code and comment. To efficiently identify outdated comments, we extract multiple features from both codes and comments before and after they change. Besides, we also consider the relation between code and comment in our model. Experiment results show that CoCC can effectively detect outdated comments with precision over 90%. In addition, we have identified the 15 most important factors that cause outdated comments, and verified the applicability of CoCC in different programming languages. We also used CoCC to find outdated comments in the latest commits of open source projects, which further proves the effectiveness of the proposed method

arXiv.org e-Print Archive

Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization

Author: Chen Xiao
Li Ge
Mu Fangwen
Shi Lin
Wang Junjie
Wang Qing
Wang Song
Xia Xin
Yang Ye
Publication venue
Publication date: 12/07/2022
Field of study

Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice

arXiv.org e-Print Archive

Does your code need comment?

Author: Mitchell TM
Witten IH
Publication venue: 'Wiley'
Publication date
Field of study

Crossref