Correlating Automated and Human Evaluation of Code Documentation Generation Quality (ICSE 2023 - Journal-First Papers)

Who

Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, Thomas Zimmermann

Track

ICSE 2023 Journal-First Papers

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 May 2023 16:52 - 17:00 at Level G - Plenary Room 1 - Documentation Chair(s): Denys Poshyvanyk

Abstract

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. They compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate presence or absence of correlations between these metrics and human judgements. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.

Xing Hu

Zhejiang University

China

Qiuyuan Chen

Zhejiang University

Haoye Wang

Hangzhou City University

Xin Xia

Huawei

China

David Lo

Singapore Management University

Singapore

Thomas Zimmermann

Microsoft Research

United States