Out of the BLEU: How should we assess quality of the Code Generation models? (ASE 2023 - Journal-first Papers)

Who

Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, Timofey Bryksin

Track

ASE 2023 Journal-first Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 14 Sep 2023 16:18 - 16:30 at Room C - Code Generation 3 Chair(s): David Lo

Abstract

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well they agree with the human evaluation on this task. There are also other metrics, CodeBLEU and RUBY, developed to estimate the similarity of code, that take into account the properties of source code. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores have been used in recent papers to claim superiority of some code generation models over the others. In this paper, we present a study on the applicability of six metrics – BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, and RUBY – for evaluation of code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners, none of the metrics can correctly emulate human judgement on which model is better with >95% certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of a particular structure, a difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Our findings suggest that the ChrF metric is a better fit for the evaluation of code generation models than the commonly used BLEU and CodeBLEU. Yet, finding a metric for code generation that closely agrees with humans requires additional work.

Link to Publication

https://www.sciencedirect.com/science/article/abs/pii/S016412122300136X

Link to Preprint

https://arxiv.org/abs/2208.03133

DOI

https://doi.org/10.1016/j.jss.2023.111741

File attachments

Out of the BLEU -- presentation (Out of the BLEU -- ASE-1.pdf)	887KiB

Mikhail Evtikhiev

JetBrains Research

Cyprus

Egor Bogomolov

JetBrains Research

Cyprus

Yaroslav Sokolov

JetBrains

Timofey Bryksin

JetBrains Research

Cyprus

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 14 Sep
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

15:30 - 17:00	Code Generation 3Research Papers / Journal-first Papers at Room C Chair(s): David Lo Singapore Management University

15:30 12m Talk		Improving code extraction from coding screencasts using a code-aware encoder-decoder model Research Papers Abdulkarim Malkadi Florida State University, USA - Jazan University, KSA, Ahmad Tayeb Florida State University, USA, Sonia Haiduc Florida State University File Attached
15:42 12m Talk		InfeRE: Step-by-Step Regex Generation via Chain of Inference Research Papers Shuai Zhang School of Software, Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University, Yuting Chen Shanghai Jiao Tong University Pre-print File Attached
15:54 12m Talk		MELT: Mining Effective Lightweight Transformations from Pull Requests Research Papers Daniel Ramos Carnegie Mellon University, and INESC-ID, Hailie Mitchell Carnegie Mellon University, Ines Lynce INESC-ID/IST, Universidade de Lisboa, Vasco Manquinho INESC-ID; Universidade de Lisboa, Ruben Martins Carnegie Mellon University, Claire Le Goues Carnegie Mellon University Pre-print File Attached
16:06 12m Talk		On the Evaluation of Neural Code Translation: Taxonomy and Benchmark Research Papers Mingsheng Jiao Shanghai Jiao Tong University, Tingrui Yu Shanghai Jiao Tong University, Xuan Li Shanghai Jiao Tong University, Guan Jie Qiu Shanghai Jiao Tong University, Xiaodong Gu Shanghai Jiao Tong University, Beijun Shen Shanghai Jiao Tong University Pre-print File Attached
16:18 12m Talk		Out of the BLEU: How should we assess quality of the Code Generation models? Journal-first Papers Mikhail Evtikhiev JetBrains Research, Egor Bogomolov JetBrains Research, Yaroslav Sokolov JetBrains, Timofey Bryksin JetBrains Research Link to publication DOI Pre-print File Attached
16:30 12m Talk		Pluggable Type Inference for Free Research Papers Martin Kellogg New Jersey Institute of Technology, Daniel Daskiewicz New Jersey Institute of Technology, Loi Ngo Duc Nguyen New Jersey Institute of Technology, Muyeed Ahmed New Jersey Institute of Technology, Michael D. Ernst University of Washington Link to publication Pre-print File Attached