FORGE 2025
Sun 27 - Mon 28 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025

This program is tentative and subject to change.

Context: Efficient Pull Request (PR) review process is critical in software development. This process includes checking the alignment between PRs and their corresponding issues. The traditional manual PR review often struggles with identifying inconsistencies between the intended improvements or fixes outlined in issues and the actual changes proposed in PRs. This difference can lead to overlooked inconsistencies in the PR acceptance process.

Objective: We aim to enhance the PR review process by leveraging modern LLMs to detect inconsistencies between issue descriptions and code changes in submitted PRs.

Method: We manually labeled a statistically significant sample of PRs from the Transformers repository to assess their alignment with corresponding issue descriptions. Each PR was categorized into one of four groups: exact, missing, tangling, or missing and tangling. This labeled dataset served as the benchmark for evaluating the performance of four widely used models: Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, GPT-4o, and GPT-4o mini. The models were tested using three distinct prompts designed to capture different aspects of issues and PRs. Each model was tasked with identifying tangled and missing elements, and their outputs were compared against the manually labeled data to assess their accuracy and reliability.

Results: The manual labeling process in the stratified-sampled Transformers repository revealed the following distribution of PR-issue pair alignments: 68.04% were exact, 16.5% were missing, 13.40% were tangling, and 2.06% exhibited both missing and tangling characteristics. A strong correlation was observed between PR merge status and exact alignment, with 75.46% of merged PRs classified as exact, compared to only 29.03% of unmerged PRs. These findings highlight opportunities for improving the current code review process. For automated classification, the most effective prompt configuration combined issue text, PR text, and PR diff, enabling better detection of alignment inconsistencies. Among the models tested, GPT-4o and Llama-3.1-405B-Instruct delivered the highest performance, achieving the best F1 weighted scores of 0.5948 and 0.6190, respectively.

Conclusion: Despite a notable correlation between PR merge status and exact alignment, our analysis revealed that merged PRs can still contain inconsistencies, such as missing or tangling changes. While the tested LLMs showed potential in automating PR-issue alignment, their current performance is limited. This underscores the need for further refinement to enhance their accuracy and reliability. Improved LLM-based tools could streamline the PR review process, reducing manual effort and enhancing code quality.

This program is tentative and subject to change.

Mon 28 Apr

Displayed time zone: Eastern Time (US & Canada) change

09:00 - 10:30
FORGE2025 Keynote & Session3: Collaborative Software DevelopmentKeynotes / Research Papers at 207
09:00
60m
Keynote
Keynote: Trust No Bot? Forging Confidence in AI for Software Engineering
Keynotes
Thomas Zimmermann University of California, Irvine
10:00
12m
Long-paper
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology
Research Papers
Minh Nguyen Huynh FPT Software AI Center, Thang Phan Chau FPT Software AI Center, Phong X. Nguyen FPT Software AI Center, Nghi D. Q. Bui Salesforce Research
10:12
12m
Long-paper
Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests
Research Papers
Ali Tunahan Işık Bilkent University, Hatice Kübra Çağlar Bilkent University, Eray Tüzün Bilkent University
:
:
:
: