Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests
This program is tentative and subject to change.
Context: Efficient Pull Request (PR) review process is critical in software development. This process includes checking the alignment between PRs and their corresponding issues. The traditional manual PR review often struggles with identifying inconsistencies between the intended improvements or fixes outlined in issues and the actual changes proposed in PRs. This difference can lead to overlooked inconsistencies in the PR acceptance process.
Objective: We aim to enhance the PR review process by leveraging modern LLMs to detect inconsistencies between issue descriptions and code changes in submitted PRs.
Method: We manually labeled a statistically significant sample of PRs from the Transformers repository to assess their alignment with corresponding issue descriptions. Each PR was categorized into one of four groups: exact, missing, tangling, or missing and tangling. This labeled dataset served as the benchmark for evaluating the performance of four widely used models: Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, GPT-4o, and GPT-4o mini. The models were tested using three distinct prompts designed to capture different aspects of issues and PRs. Each model was tasked with identifying tangled and missing elements, and their outputs were compared against the manually labeled data to assess their accuracy and reliability.
Results: The manual labeling process in the stratified-sampled Transformers repository revealed the following distribution of PR-issue pair alignments: 68.04% were exact, 16.5% were missing, 13.40% were tangling, and 2.06% exhibited both missing and tangling characteristics. A strong correlation was observed between PR merge status and exact alignment, with 75.46% of merged PRs classified as exact, compared to only 29.03% of unmerged PRs. These findings highlight opportunities for improving the current code review process. For automated classification, the most effective prompt configuration combined issue text, PR text, and PR diff, enabling better detection of alignment inconsistencies. Among the models tested, GPT-4o and Llama-3.1-405B-Instruct delivered the highest performance, achieving the best F1 weighted scores of 0.5948 and 0.6190, respectively.
Conclusion: Despite a notable correlation between PR merge status and exact alignment, our analysis revealed that merged PRs can still contain inconsistencies, such as missing or tangling changes. While the tested LLMs showed potential in automating PR-issue alignment, their current performance is limited. This underscores the need for further refinement to enhance their accuracy and reliability. Improved LLM-based tools could streamline the PR review process, reducing manual effort and enhancing code quality.
This program is tentative and subject to change.
Mon 28 AprDisplayed time zone: Eastern Time (US & Canada) change
09:00 - 10:30 | |||
09:00 60mKeynote | Keynote: Trust No Bot? Forging Confidence in AI for Software Engineering Keynotes Thomas Zimmermann University of California, Irvine | ||
10:00 12mLong-paper | AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology Research Papers Minh Nguyen Huynh FPT Software AI Center, Thang Phan Chau FPT Software AI Center, Phong X. Nguyen FPT Software AI Center, Nghi D. Q. Bui Salesforce Research | ||
10:12 12mLong-paper | Enhancing Pull Request Reviews: Leveraging Large Language Models to Detect Inconsistencies Between Issues and Pull Requests Research Papers Ali Tunahan Işık Bilkent University, Hatice Kübra Çağlar Bilkent University, Eray Tüzün Bilkent University |