Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation (FORGE 2024 - Research Track)

Who

Marcos Macedo, Yuan Tian, Filipe Cogo, Bram Adams

Track

FORGE 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 14 Apr 2024 14:40 - 14:54 at Luis de Freitas Branco - Keynote 2 & Properties of Foundation Models Chair(s): David Lo, Feifei Niu

Abstract

Code translation between programming languages is a long-existing and critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. With the recent advances in large language models (LLMs) and their applications to code translation, there is an increasing need for comprehensive evaluation of these models. In this study, we empirically analyze the generated outputs of eleven popular instruct-tuned LLMs with parameters ranging from 1B up to 46.7B on 3,820 translation pairs across five languages including C, C++, Go, Java, and Python. In our analysis, we found that between 26.4% and 73.7% of code translations produced by our evaluated LLMs necessitate post-processing. This is because these translations often include a mix of code, quotes, and text, rather than being purely source code. Overlooking the output format of these models can inadvertently lead to underestimation of their true performance. This is particularly evident when evaluating them with execution-based metrics such as Computational Accuracy (CA). Our research demonstrates a strategic combination of prompt engineering and regular expression usage that can effectively extract the source code from the model generation output. Results show that our method can help eleven selected models achieve an average Code Extraction Success Rate (CSR) of 92.73%. We believe our findings shed light and motivate future research in conducting more reliable benchmarks of LLMs for code translation.

Link to Preprint

https://arxiv.org/abs/2403.17214

Marcos Macedo

Queen's University, Kingston, Ontario

Canada

Yuan Tian

Queen's University, Kingston, Ontario

Canada

Filipe Cogo

Centre for Software Excellence, Huawei Canada

Canada

Bram Adams

Queen's University

Canada

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 14 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Keynote 2 & Properties of Foundation ModelsResearch Track / Keynotes at Luis de Freitas Branco Chair(s): David Lo Singapore Management University, Feifei Niu University of Ottawa

14:00 40m Keynote		Keynote 2: Towards an Interpretable Science of Deep Learning for Software Engineering: A Causal Inference View Keynotes Denys Poshyvanyk William & Mary
14:40 14m Full-paper		Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code TranslationFull Paper Research Track Marcos Macedo Queen's University, Kingston, Ontario, Yuan Tian Queen's University, Kingston, Ontario, Filipe Cogo Centre for Software Excellence, Huawei Canada, Bram Adams Queen's University Pre-print
14:54 7m Short-paper		Is Attention All You Need? Toward a Conceptual Model for Social Awareness in Large Language ModelsNew Idea Paper Research Track Gianmario Voria University of Salerno, Gemma Catolino University of Salerno, Fabio Palomba University of Salerno Pre-print
15:01 14m Full-paper		An Exploratory Investigation into Code License Infringements in Large Language Model Training DatasetsFull Paper Research Track Jonathan Katzy Delft University of Technology, Răzvan Mihai Popescu Delft University of Technology, Arie van Deursen Delft University of Technology, Maliheh Izadi Delft University of Technology
15:15 15m Other		Discussion Research Track