Automated Inline Comment Smell Detection and Repair with Large Language Models (ASE 2025 - Research Papers)

Who

Hatice Kübra Çağlar, Semih Çağlar, Eray Tüzün

Track

ASE 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 19 Nov 2025 11:00 - 11:10 at Grand Hall 3 - Maintenance & Evolution 2

Abstract

Context: Code comments play a critical role in improving code readability, maintainability, and collaborative development. However, comments may deviate from best practices due to software evolution, where code changes are not reflected in comments, as well as practitioner-related issues such as vague descriptions, redundancy, or misaligned intent. These issues lead to various comment smells that degrade software quality. While prior studies have explored comment inconsistencies, most are limited in scope, either addressing a narrow subset of smells or focusing solely on detection without considering repair.

Objective: This study evaluates the effectiveness of large language models (LLMs) in both detecting and repairing inline code comment smells, using a comprehensive taxonomy of code comment smell types.

Method: We extended a prior data set by incorporating repaired versions of smelly comments, resulting in 2,211 unique instances. Four LLMs—GPT-4o-mini, o3-mini, DeepSeek-V3, and Codestral-2501—are evaluated under zero-shot and few-shot prompting strategies. To account for non-deterministic behavior in LLM outputs and ensure robustness, each configuration is executed five times. Detection performance is measured using accuracy, macro F1 score, and Matthews correlation coefficient (MCC); repair is evaluated using SBERT similarity, METEOR, and ROUGE-L. Our multi-stage pipeline feeds detection outputs into the repair phase, where the detection result with the highest macro F1 score is used to simulate the best possible repair scenario. Median scores across runs are reported for model comparison.

Results: o3-mini with few-shot prompting achieves the highest median detection performance: macro F1 of 0.41, MCC of 0.50, and accuracy of 0.72, exceeding the baseline of GPT-4. For repair, Codestral-2501 in the zero-shot setting yields the best results with a median SBERT score of 0.61, followed by DeepSeek-V3 and GPT-4o-mini at 0.53, and o3-mini at 0.46. Few-shot prompts improve detection, while zero-shot prompts are more effective for repair.

Conclusion: Lightweight LLMs such as o3-mini can achieve strong detection performance when guided by effective few-shot prompts. For example, o3-mini with few-shot prompting attains the highest median detection results: macro F1 of 0.41, MCC of 0.50, and accuracy of 0.72, surpassing the GPT-4 baseline. In contrast, repair tasks benefit more from zero-shot prompting, though they introduce challenges such as overfitting and the risk of generating new smells. Our findings support the development of practical tools, including a GitHub-integrated comment repair assistant, and motivate future work on dynamic prompt selection and multilingual benchmark construction.

Link to Preprint

https://www.researchgate.net/publication/395359325_Automated_Inline_Comment_Smell_Detection_and_Repair_with_Large_Language_Models

Hatice Kübra Çağlar

Bilkent University

Semih Çağlar

Bilkent University

Eray Tüzün

Bilkent University

Turkey

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 19 Nov
Displayed time zone: Seoul change

11:00 - 12:30	Maintenance & Evolution 2Research Papers / Journal-First Track at Grand Hall 3

11:00 10m Talk		Automated Inline Comment Smell Detection and Repair with Large Language Models Research Papers Hatice Kübra Çağlar Bilkent University, Semih Çağlar Bilkent University, Eray Tüzün Bilkent University Pre-print
11:10 10m Talk		What’s DAT Smell? Untangling and Weaving the Disjoint Assertion Tangle Test Smell Research Papers Monil Narang University of California, Irvine, Hang Du University of California at Irvine, James Jones University of California at Irvine Pre-print
11:20 10m Talk		Your Build Scripts Stink: The State of Code Smells in Build Scripts Research Papers Mahzabin Tamanna North Carolina State University, Yash Chandrani North Carolina State University, Matthew Burrows North Carolina State University, Brandon Wroblewski North Carolina State University, Dominik Wermke North Carolina State University, Laurie Williams North Carolina State University
11:30 10m Talk		Do Experts Agree About Smelly Infrastructure? Journal-First Track Sogol Masoumzadeh Mcgill University, Nuno Saavedra INESC-ID and IST, University of Lisbon, Rungroj Maipradit University of Waterloo, Lili Wei McGill University, João F. Ferreira INESC-ID and IST, University of Lisbon, Daniel Varro Linköping University / McGill University, Shane McIntosh University of Waterloo
11:40 10m Talk		Wired for Reuse: Automating Context-Aware Code Adaptation in IDEs via LLM-Based Agent Research Papers Taiming Wang Beijing Institute of Technology, Yanjie Jiang Peking University, Chunhao Dong Beijing Institute of Technology, Yuxia Zhang Beijing Institute of Technology, Hui Liu Beijing Institute of Technology
11:50 10m Talk		BinStruct: Binary Structure Recovery Combining Static Analysis and Semantics Research Papers Yiran Zhang , Zhengzi Xu Imperial Global Singapore, Zhe Lang Institute of Information Engineering, CAS, CHENGYUE LIU , Yuqiang Sun Nanyang Technological University, Wenbo Guo School of Cyber Science and Engineering, Sichuan University, Chengwei Liu Nanyang Technological University, Weisong Sun Nanyang Technological University, Yang Liu Nanyang Technological University
12:00 10m Talk		SateLight: A Satellite Application Update Framework for Satellite Computing Research Papers Jinfeng Wen Beijing University of Posts and Telecommunications, Jianshu Zhao Beijing University of Posts and Telecommunications, Zixi Zhu Beijing University of Posts and Telecommunications, Xiaomin Zhang Beijing University of Posts and Telecommunications, Qi Liang Beijing University of Posts and Telecommunications, Ao Zhou Beijing University of Posts and Telecommunications, Shangguang Wang Beijing University of Posts and Telecommunications
12:10 10m Talk		ComCat: Expertise-Guided Context Generation to Enhance Code Comprehension Journal-First Track Skyler Grandel Vanderbilt University, Scott Andersen National Autonomous University of Mexico, Yu Huang Vanderbilt University, Kevin Leach Vanderbilt University
12:20 10m Talk		AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation Research Papers Tanghaoran Zhang National University of Defense Technology, Xinjun Mao National University of Defense Technology, Shangwen Wang National University of Defense Technology, Yuxin Zhao Key Laboratory of Software Engineering for Complex Systems, National University of Defense Technology, Yao Lu National University of Defense Technology, Jin Zhang Hunan Normal University, Zhang Zhang Key Laboratory of Software Engineering for Complex Systems, National University of Defense Technology, Kang Yang National University of Defense Technology, Yue Yu PengCheng Lab