Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests (FORGE 2025 - Research Papers)

Who

Ali Tunahan Işık, Hatice Kübra Çağlar, Eray Tüzün

Track

FORGE 2025 Research Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 10:12 - 10:24 at 207 - FORGE2025 Keynote & Session3: Collaborative Software Development Chair(s): Xin Xia, Yuan Tian

Abstract

Context: Efficient Pull Request (PR) review process is critical in software development. This process includes checking the alignment between PRs and their corresponding issues. The traditional manual PR review often struggles with identifying inconsistencies between the intended improvements or fixes outlined in issues and the actual changes proposed in PRs. This difference can lead to overlooked inconsistencies in the PR acceptance process.

Objective: We aim to enhance the PR review process by leveraging modern LLMs to detect inconsistencies between issue descriptions and code changes in submitted PRs.

Method: We manually labeled a statistically significant sample of PRs from the Transformers repository to assess their alignment with corresponding issue descriptions. Each PR was categorized into one of four groups: exact, missing, tangling, or missing and tangling. This labeled dataset served as the benchmark for evaluating the performance of four widely used models: Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, GPT-4o, and GPT-4o mini. The models were tested using three distinct prompts designed to capture different aspects of issues and PRs. Each model was tasked with identifying tangled and missing elements, and their outputs were compared against the manually labeled data to assess their accuracy and reliability.

Results: The manual labeling process in the stratified-sampled Transformers repository revealed the following distribution of PR-issue pair alignments: 68.04% were exact, 16.5% were missing, 13.40% were tangling, and 2.06% exhibited both missing and tangling characteristics. A strong correlation was observed between PR merge status and exact alignment, with 75.46% of merged PRs classified as exact, compared to only 29.03% of unmerged PRs. These findings highlight opportunities for improving the current code review process. For automated classification, the most effective prompt configuration combined issue text, PR text, and PR diff, enabling better detection of alignment inconsistencies. Among the models tested, GPT-4o and Llama-3.1-405B-Instruct delivered the highest performance, achieving the best F1 weighted scores of 0.5948 and 0.6190, respectively.

Conclusion: Despite a notable correlation between PR merge status and exact alignment, our analysis revealed that merged PRs can still contain inconsistencies, such as missing or tangling changes. While the tested LLMs showed potential in automating PR-issue alignment, their current performance is limited. This underscores the need for further refinement to enhance their accuracy and reliability. Improved LLM-based tools could streamline the PR review process, reducing manual effort and enhancing code quality.

Ali Tunahan Işık

Bilkent University

Turkey

Hatice Kübra Çağlar

Bilkent University

Eray Tüzün

Bilkent University

Turkey

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 28 Apr
Displayed time zone: Eastern Time (US & Canada) change

09:00 - 10:30	FORGE2025 Keynote & Session3: Collaborative Software DevelopmentResearch Papers / Keynotes at 207 Chair(s): Xin Xia Huawei, Yuan Tian Queen's University, Kingston, Ontario

09:00 60m Keynote		Keynote: Large language models for agentic software engineering Keynotes Graham Neubig Carnegie Mellon University
10:00 12m Long-paper		AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology Research Papers Minh Nguyen Huynh FPT Software AI Center, Thang Phan Chau FPT Software AI Center, Phong X. Nguyen FPT Software AI Center, Nghi D. Q. Bui Salesforce Research
10:12 12m Long-paper		Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests Research Papers Ali Tunahan Işık Bilkent University, Hatice Kübra Çağlar Bilkent University, Eray Tüzün Bilkent University