ASE 2023
Mon 11 - Fri 15 September 2023 Kirchberg, Luxembourg
Thu 14 Sep 2023 16:18 - 16:30 at Room C - Code Generation 3 Chair(s): David Lo

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY – for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.

Out of the BLEU -- presentation (Out of the BLEU -- ASE-1.pdf)887KiB

Thu 14 Sep

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

15:30 - 17:00
Code Generation 3Research Papers / Journal-first Papers at Room C
Chair(s): David Lo Singapore Management University
15:30
12m
Talk
Improving code extraction from coding screencasts using a code-aware encoder-decoder model
Research Papers
Abdulkarim Malkadi Florida State University, USA - Jazan University, KSA, Ahmad Tayeb Florida State University, USA, Sonia Haiduc Florida State University
File Attached
15:42
12m
Talk
InfeRE: Step-by-Step Regex Generation via Chain of Inference
Research Papers
Shuai Zhang School of Software, Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University, Yuting Chen Shanghai Jiao Tong University
Pre-print File Attached
15:54
12m
Talk
MELT: Mining Effective Lightweight Transformations from Pull Requests
Research Papers
Daniel Ramos Carnegie Mellon University, and INESC-ID, Hailie Mitchell Carnegie Mellon University, Ines Lynce INESC-ID/IST, Universidade de Lisboa, Vasco Manquinho INESC-ID; Universidade de Lisboa, Ruben Martins Carnegie Mellon University, Claire Le Goues Carnegie Mellon University
Pre-print File Attached
16:06
12m
Talk
On the Evaluation of Neural Code Translation: Taxonomy and Benchmark
Research Papers
Mingsheng Jiao Shanghai Jiao Tong University, Tingrui Yu Shanghai Jiao Tong University, Xuan Li Shanghai Jiao Tong University, Guan Jie Qiu Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University
Pre-print File Attached
16:18
12m
Talk
Out of the BLEU: How should we assess quality of the Code Generation models?
Journal-first Papers
Mikhail Evtikhiev JetBrains Research, Egor Bogomolov JetBrains Research, Yaroslav Sokolov JetBrains, Timofey Bryksin JetBrains Research
Link to publication DOI Pre-print File Attached
16:30
12m
Talk
Pluggable Type Inference for Free
Research Papers
Martin Kellogg New Jersey Institute of Technology, Daniel Daskiewicz New Jersey Institute of Technology, Loi Ngo Duc Nguyen New Jersey Institute of Technology, Muyeed Ahmed New Jersey Institute of Technology, Michael D. Ernst University of Washington
Link to publication Pre-print File Attached