Correlating Automated and Human Evaluation of Code Documentation Generation Quality
Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. They compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate presence or absence of correlations between these metrics and human judgements. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.
Wed 17 MayDisplayed time zone: Hobart change
15:45 - 17:15 | DocumentationTechnical Track / Journal-First Papers at Level G - Plenary Room 1 Chair(s): Denys Poshyvanyk College of William and Mary | ||
15:45 15mTalk | Developer-Intent Driven Code Comment Generation Technical Track Fangwen Mu Institute of Software Chinese Academy of Sciences, Xiao Chen Institute of Software Chinese Academy of Sciences, Lin Shi ISCAS, Song Wang York University, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences Pre-print | ||
16:00 15mTalk | Data Quality Matters: A Case Study of ObsoleteComment Detection Technical Track Shengbin Xu Nanjing University, Yuan Yao Nanjing University, Feng Xu Nanjing University, Tianxiao Gu TikTok Inc., Jingwei Xu , Xiaoxing Ma Nanjing University Pre-print | ||
16:15 15mTalk | Revisiting Learning-based Commit Message Generation Technical Track Jinhao Dong Peking University, Yiling Lou Fudan University, Dan Hao Peking University, Lin Tan Purdue University Pre-print | ||
16:30 15mTalk | Commit Message Matters: Investigating Impact and Evolution of Commit Message Quality Technical Track | ||
16:45 7mTalk | On the Significance of Category Prediction for Code-Comment Synchronization Journal-First Papers Zhen Yang City University of Hong Kong, China, Jacky Keung City University of Hong Kong, Xiao Yu Wuhan University of Technology, Yan Xiao National University of Singapore, Zhi Jin Peking University, Jingyu Zhang City University of Hong Kong | ||
16:52 7mTalk | Correlating Automated and Human Evaluation of Code Documentation Generation Quality Journal-First Papers Xing Hu Zhejiang University, Qiuyuan Chen Zhejiang University, Haoye Wang Hangzhou City University, Xin Xia Huawei, David Lo Singapore Management University, Thomas Zimmermann Microsoft Research | ||
17:00 7mTalk | Predictive Comment Updating with Heuristics and AST-Path-Based Neural Learning: A Two-Phase Approach Journal-First Papers Bo Lin National University of Defense Technology, Shangwen Wang National University of Defense Technology, Zhongxin Liu Zhejiang University, Xin Xia Huawei, Xiaoguang Mao National University of Defense Technology Link to publication DOI Pre-print |