A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics (PROMISE 2025)

Thu 26 Jun 2025 Trondheim, Norway

co-located with FSE 2025

Who

Jonathan Katzy, Yongcheng Huang, Gopal-Raj Panchu, Maksym Ziemlewski, Paris Loizides, Sander Vermeulen, Arie van Deursen, Maliheh Izadi

Track

PROMISE 2025

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Jun 2025 15:01 - 15:15 at Vega - Session 2 Chair(s): Heng Li

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multi-lingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment correctness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

Link to Preprint

https://arxiv.org/abs/2505.15469

Jonathan Katzy

Delft University of Technology

Yongcheng Huang

Delft University of Technology

Gopal-Raj Panchu

Delft University of Technology

Netherlands

Maksym Ziemlewski

Delft University of Technology

Paris Loizides

Delft University of Technology

Sander Vermeulen

Delft University of Technology

Arie van Deursen

TU Delft

Netherlands

Maliheh Izadi

Delft University of Technology

Netherlands

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:30	Session 2PROMISE 2025 at Vega Chair(s): Heng Li Polytechnique Montréal

14:00 60m Keynote		Keynote 2 (Dr. Haipeng Cai) PROMISE 2025 Haipeng Cai University at Buffalo, SUNY
15:01 14m Talk		A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics PROMISE 2025 Jonathan Katzy Delft University of Technology, Yongcheng Huang Delft University of Technology, Gopal-Raj Panchu Delft University of Technology, Maksym Ziemlewski Delft University of Technology, Paris Loizides Delft University of Technology, Sander Vermeulen Delft University of Technology, Arie van Deursen TU Delft, Maliheh Izadi Delft University of Technology Pre-print
15:16 9m Talk		Near-Duplicate Build Failure Detection from Continuous Integration Logs PROMISE 2025 Mingchen Li University of Helsinki, Mika Mäntylä University of Helsinki and University of Oulu, Jesse Nyyssölä University of Helsinki, Matti Luukkainen University of Helsinki

Information for Participants

Thu 26 Jun 2025 14:00 - 15:30 at Vega - Session 2 Chair(s): Heng Li

Info for room Vega:

Vega is close to the registration desk.

Facing the registration desk, its entrance is on the left, close to the hotel side entrance.