Out of the BLEU: How should we assess quality of the Code Generation models?
In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY – for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.
Out of the BLEU -- presentation (Out of the BLEU -- ASE-1.pdf) | 887KiB |
Thu 14 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
15:30 - 17:00 | Code Generation 3Research Papers / Journal-first Papers at Room C Chair(s): David Lo Singapore Management University | ||
15:30 12mTalk | Improving code extraction from coding screencasts using a code-aware encoder-decoder model Research Papers Abdulkarim Malkadi Florida State University, USA - Jazan University, KSA, Ahmad Tayeb Florida State University, USA, Sonia Haiduc Florida State University File Attached | ||
15:42 12mTalk | InfeRE: Step-by-Step Regex Generation via Chain of Inference Research Papers Shuai Zhang School of Software, Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University, Yuting Chen Shanghai Jiao Tong University Pre-print File Attached | ||
15:54 12mTalk | MELT: Mining Effective Lightweight Transformations from Pull Requests Research Papers Daniel Ramos Carnegie Mellon University, and INESC-ID, Hailie Mitchell Carnegie Mellon University, Ines Lynce INESC-ID/IST, Universidade de Lisboa, Vasco Manquinho INESC-ID; Universidade de Lisboa, Ruben Martins Carnegie Mellon University, Claire Le Goues Carnegie Mellon University Pre-print File Attached | ||
16:06 12mTalk | On the Evaluation of Neural Code Translation: Taxonomy and Benchmark Research Papers Mingsheng Jiao Shanghai Jiao Tong University, Tingrui Yu Shanghai Jiao Tong University, Xuan Li Shanghai Jiao Tong University, Guan Jie Qiu Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University Pre-print File Attached | ||
16:18 12mTalk | Out of the BLEU: How should we assess quality of the Code Generation models? Journal-first Papers Mikhail Evtikhiev JetBrains Research, Egor Bogomolov JetBrains Research, Yaroslav Sokolov JetBrains, Timofey Bryksin JetBrains Research Link to publication DOI Pre-print File Attached | ||
16:30 12mTalk | Pluggable Type Inference for Free Research Papers Martin Kellogg New Jersey Institute of Technology, Daniel Daskiewicz New Jersey Institute of Technology, Loi Ngo Duc Nguyen New Jersey Institute of Technology, Muyeed Ahmed New Jersey Institute of Technology, Michael D. Ernst University of Washington Link to publication Pre-print File Attached |