From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models (ESEIW 2025 - ESEM - Industry, Government, and Community Track )

Who

Luca Mariotto, Christian Medeiros Adriano, René Eichhorn, Daniel Burgstahler, Holger Giese

Track

ESEIW 2025 ESEM - Industry, Government, and Community Track

Time Zone

The program is currently displayed in (GMT-10:00) Hawaii.

Use conference time zone: (GMT-10:00) HawaiiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 2 Oct 2025 14:05 - 14:20 at Kaiulani II - Program Comprehension and Review 1 Chair(s): Nicole Novielli

Abstract

Efficient code review is critical in industrial software development, yet assessing Pull Request (PR) quality at scale remains a persistent challenge. Large Language Models (LLMs) offer promise for automating quality evaluation and enhancement, but their practical deployment requires scalability, cross-model consistency, and alignment with code review competencies. This paper presents empirical insights with direct relevance to industrial practice, derived from two complementary studies.

First, we conducted large-scale automated quality scoring on 212,687 PRs from 82 diverse open-source repositories using zero-shot classification based on six software engineering competencies. This analysis uncovered distinct quality profiles and project archetypes, demonstrating the feasibility of scalable, automated PR evaluation.

Second, we evaluated the model agnosticism of six state-of-the-art LLMs (ChatGPT-4o, Claude Sonnet, Deepseek Deepthink, Gemini Flash, Grok, Qwen) by generating controlled PR description variants (Degraded, Improved) and comparing outputs. Results showed partial semantic consistency (SBERT cosine=0.74, BERTScore F1=0.85), but significant variability in lexical and stylistic features (e.g., BLEU=0.03), highlighting limitations in consistency and user experience when switching between models. These findings emphasize the importance of model-aware tool design to manage output variability, cost, and user trust.

Third, based on these results, we conducted a controlled experiment with 38 software professionals assessing LLM-generated PR descriptions. Participants compared Original (O) human-written descriptions with Degraded (D), Improved from Original (IO), and Improved from Degraded (ID) versions produced through a multi-stage LLM pipeline. Surprisingly, the ID variants were rated significantly higher than both D (p-value=0.029) and O (p-value=0.026), largely due to the added structure and clarity introduced by the models. However, IO variants were not statistically significant preferred (p-value>0.05) and often received negative feedback for verbosity and a generic “AI tone.” These results suggest that LLMs can enhance clarity and structure but must be guided carefully to avoid output homogenization and loss of human-authored nuance.

We conclude by presenting a proof-of-concept system, developed in collaboration with Mercedes-Benz Tech Innovation GmbH, that integrates these insights into a practical toolchain. Our findings advocate for human-centered, model-aware LLM integration strategies to support scalable, consistent, and competency-aligned code review processes in industry.

Luca Mariotto

Hasso-Plattner Institute

Germany

Christian Medeiros Adriano

Hasso Plattner Institute, University of Potsdam

Germany

René Eichhorn

Mercedes-Benz Tech Innovation

Germany

Daniel Burgstahler

Mercedes-Benz Tech Innovation

Germany

Holger Giese

Hasso Plattner Institute, University of Potsdam

Germany

Time Zone

The program is currently displayed in (GMT-10:00) Hawaii.

Use conference time zone: (GMT-10:00) HawaiiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 2 Oct
Displayed time zone: Hawaii change

13:50 - 14:50	Program Comprehension and Review 1ESEM - Industry, Government, and Community Track / ESEM - Emerging Results and Vision Track / ESEM - Technical Track / at Kaiulani II Chair(s): Nicole Novielli University of Bari

13:50 15m Talk		When Retriever Meets Generator: A Joint Model for Code Comment Generation ESEM - Emerging Results and Vision Track Tien L. T. Pham Hanoi University of Science and Technology, Anh M. T. Bui Hanoi University of Science and Technology, Huy N. D. Pham AI Young Talent Academy (AI4Life), Hanoi University of Science and Technology, Alessio Bucaioni Malardalen University, Phuong T. Nguyen University of L’Aquila Pre-print
14:05 15m Talk		From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models ESEM - Industry, Government, and Community Track Luca Mariotto Hasso-Plattner Institute, Christian Medeiros Adriano Hasso Plattner Institute, University of Potsdam, René Eichhorn Mercedes-Benz Tech Innovation, Daniel Burgstahler Mercedes-Benz Tech Innovation, Holger Giese Hasso Plattner Institute, University of Potsdam
14:20 15m Talk		Rethinking Code Review Workflows with LLM Assistance: An Empirical Study ESEM - Industry, Government, and Community Track Fannar Steinn Aðalsteinsson WirelessCar Sweden AB & Chalmers University of Technology, Björn Borgar Magnússon WirelessCar Sweden AB, Mislav Milicevic WirelessCar Sweden AB, Adam Nirving Davidsson WirelessCar Sweden AB, Chih-Hong Cheng Carl von Ossietzky Universität Oldenburg & Chalmers University of Technology
14:35 15m Talk		Interrogative Comments Posed by Review Comment Generators: An Empirical Study of Gerrit ESEM - Technical Track Farshad Kazemi University of Waterloo, Maxime Lamothe Polytechnique Montreal, Shane McIntosh University of Waterloo Pre-print