Calibration of Large Language Models on Code Summarization (FSE 2025 - Research Papers)

Mon 23 - Fri 27 June 2025 Trondheim, Norway

co-located with ISSTA 2025

Who

Yuvraj Virk, Prem Devanbu, Toufique Ahmed

Track

FSE 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 25 Jun 2025 14:20 - 14:40 at Cosmos Hall - LLM for SE 4 Chair(s): Ting Su

Abstract

A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.

However, prior work has noted that LLM-produced summaries can be too long, disfluent, irrelevant, etc: generally, too dissimilar to what a human might say. Given an LLM-produced code summary, how can we judge if a summary is good enough? Given some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance; however, it’s difficult to gauge whether an LLM-produced summary sufficiently resembles what a human might produce, without a “golden” human-produced summary to compare against. Prior research indicates that human-produced summaries are generally preferred by human-raters, so we explore this issue in this paper. We study this resemblance question as a calibration problem: given just the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

DOI

https://doi.org/10.1145/3729400

Yuvraj Virk

UC Davis

Prem Devanbu

University of California at Davis

United States

Toufique Ahmed

IBM Research

United States

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 25 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

14:00 - 15:20	LLM for SE 4Research Papers / Journal First at Cosmos Hall Chair(s): Ting Su East China Normal University

14:00 20m Talk		Large Language Models for Software Engineering: A Systematic Literature Review Journal First Xinyi Hou Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Yue Liu Monash University, Zhou Yang Singapore Management University; University of Alberta, Kailong Wang Huazhong University of Science and Technology, Li Li Beihang University, Xiapu Luo Hong Kong Polytechnic University, David Lo Singapore Management University, John Grundy Monash University, Haoyu Wang Huazhong University of Science and Technology
14:20 20m Talk		Calibration of Large Language Models on Code Summarization Research Papers Yuvraj Virk UC Davis, Prem Devanbu University of California at Davis, Toufique Ahmed IBM Research DOI
14:40 20m Talk		Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks Research Papers Ali Al-Kaswan Delft University of Technology, Netherlands, Sebastian Deatc Delft University of Technology, Begüm Koç Delft University of Technology, Arie van Deursen TU Delft, Maliheh Izadi Delft University of Technology DOI Pre-print
15:00 20m Talk		PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing Journal First Yuwei Zhang Institute of Software Chinese Academy of Sciences, Zhi Jin Peking University, xingying Beijing University of Posts and Telecommunications, Ge Li Peking University, Fang Liu Beihang University, Jiaxin Zhu Institute of Software at Chinese Academy of Sciences, Wensheng Dou Institute of Software Chinese Academy of Sciences, Jun Wei Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences

Information for Participants

Wed 25 Jun 2025 14:00 - 15:20 at Cosmos Hall - LLM for SE 4 Chair(s): Ting Su

Info for room Cosmos Hall:

This is the main event hall of Clarion Hotel, which will be used to host keynote talks and other plenary sessions. The FSE and ISSTA banquets will also happen in this room.

The room is just in front of the registration desk, on the other side of the main conference area. The large doors with numbers “1” and “2” provide access to the Cosmos Hall.