ESEIW 2025
Mon 29 September - Fri 3 October 2025

This program is tentative and subject to change.

Thu 2 Oct 2025 14:05 - 14:20 at Kaiulani II - Program Comprehension and Review 1

Efficient code review is critical in industrial software development, yet assessing Pull Request (PR) quality at scale remains a persistent challenge. Large Language Models (LLMs) offer promise for automating quality evaluation and enhancement, but their practical deployment requires scalability, cross-model consistency, and alignment with code review competencies. This paper presents empirical insights with direct relevance to industrial practice, derived from two complementary studies.

First, we conducted large-scale automated quality scoring on 212,687 PRs from 82 diverse open-source repositories using zero-shot classification based on six software engineering competencies. This analysis uncovered distinct quality profiles and project archetypes, demonstrating the feasibility of scalable, automated PR evaluation.

Second, we evaluated the model agnosticism of six state-of-the-art LLMs (ChatGPT-4o, Claude Sonnet, Deepseek Deepthink, Gemini Flash, Grok, Qwen) by generating controlled PR description variants (Degraded, Improved) and comparing outputs. Results showed partial semantic consistency (SBERT cosine=0.74, BERTScore F1=0.85), but significant variability in lexical and stylistic features (e.g., BLEU=0.03), highlighting limitations in consistency and user experience when switching between models. These findings emphasize the importance of model-aware tool design to manage output variability, cost, and user trust.

Third, based on these results, we conducted a controlled experiment with 38 software professionals assessing LLM-generated PR descriptions. Participants compared Original (O) human-written descriptions with Degraded (D), Improved from Original (IO), and Improved from Degraded (ID) versions produced through a multi-stage LLM pipeline. Surprisingly, the ID variants were rated significantly higher than both D (p-value=0.029) and O (p-value=0.026), largely due to the added structure and clarity introduced by the models. However, IO variants were not statistically significant preferred (p-value>0.05) and often received negative feedback for verbosity and a generic “AI tone.” These results suggest that LLMs can enhance clarity and structure but must be guided carefully to avoid output homogenization and loss of human-authored nuance.

We conclude by presenting a proof-of-concept system, developed in collaboration with Mercedes-Benz Tech Innovation GmbH, that integrates these insights into a practical toolchain. Our findings advocate for human-centered, model-aware LLM integration strategies to support scalable, consistent, and competency-aligned code review processes in industry.

This program is tentative and subject to change.

Thu 2 Oct

Displayed time zone: Hawaii change

13:50 - 14:50
13:50
15m
Talk
When Retriever Meets Generator: A Joint Model for Code Comment Generation
ESEM - Emerging Results and Vision Track
Tien L. T. Pham Hanoi University of Science and Technology, Anh M. T. Bui Hanoi University of Science and Technology, Huy N. D. Pham AI Young Talent Academy (AI4Life), Hanoi University of Science and Technology, Alessio Bucaioni Malardalen University, Phuong T. Nguyen University of L’Aquila
Pre-print
14:05
15m
Talk
From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models
ESEM - Industry, Government, and Community Track
Luca Mariotto Hasso-Plattner Institute, Christian Medeiros Adriano Hasso Plattner Institute, University of Potsdam, René Eichhorn Mercedes-Benz Tech Innovation, Daniel Burgstahler Mercedes-Benz Tech Innovation, Holger Giese Hasso Plattner Institute, University of Potsdam
14:20
15m
Talk
Rethinking Code Review Workflows with LLM Assistance: An Empirical Study
ESEM - Industry, Government, and Community Track
Fannar Steinn Aðalsteinsson WirelessCar Sweden AB & Chalmers University of Technology, Björn Borgar Magnússon WirelessCar Sweden AB, Mislav Milicevic WirelessCar Sweden AB, Adam Nirving Davidsson WirelessCar Sweden AB, Chih-Hong Cheng Carl von Ossietzky Universität Oldenburg & Chalmers University of Technology
14:35
15m
Talk
Interrogative Comments Posed by Review Comment Generators: An Empirical Study of Gerrit
ESEM - Technical Track
Farshad Kazemi University of Waterloo, Maxime Lamothe Polytechnique Montreal, Shane McIntosh University of Waterloo
Pre-print