Revisiting SWE-Bench: On the Importance of Data Quality for LLM-based Code Models (ICSE 2025 - SRC - ACM Student Research Competition) - ICSE 2025

Sat 26 April - Sun 4 May 2025 Ottawa, Ontario, Canada

Track

ICSE 2025 SRC - ACM Student Research Competition

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 29 Apr 2025 11:00 - 12:30 at Canada Hall 3 Poster Area - ACM Student Research Posters and Judging Session 2
Thu 1 May 2025 12:06 - 12:12 at 204 - ACM Student Research Presentations Chair(s): Md Tajmilur Rahman, Lola Burgueño

Abstract

The use of Large Language Models (LLMs) for code generation has emerged as a rapidly growing field, gaining substantial traction within software engineering. However, ensuring the reliability and accuracy of generated code requires robust evaluation frameworks. To address this gap, Carlos et al. introduced the SWE-bench dataset, which consists of 2,294 GitHub issues paired with their corresponding pull requests, collected from 12 prominent Python repositories. This dataset has become a key benchmark for evaluating code generation models, with resolution rates prominently featured on the SWE-bench leaderboard. Despite its widespread adoption, the dataset has yet to undergo a systematic reliability assessment. Motivated by this gap, we conducted the first empirical study aimed at evaluating the reliability of the SWE-Bench dataset to ensure it provides meaningful and realistic model evaluations. We centered our analysis on the highest-performing model reported on the leaderboard at the time of the study: SWE-Agent + GPT-4. A thorough investigation was conducted by comparing the model-generated patches with the corresponding pull requests from the dataset. Our findings revealed two key issues: (1) 32.67% of successful cases were influenced by solution leakage, and (2) 31.08% succeeded due to weak test cases. When these problematic instances were excluded, the resolution rate of SWE-Agent + GPT4 dropped from 12.47% to 3.97%.

Bio

Master student at Yorku, supervised by Song Wang (https://www.eecs.yorku.ca/~wangsong/)

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 29 Apr
Displayed time zone: Eastern Time (US & Canada) change

	11:00 - 12:30	ACM Student Research Posters and Judging Session 2SRC - ACM Student Research Competition at Canada Hall 3 Poster Area

	11:00 90m Talk		On the Fly Input Refinement for Code Language Models SRC - ACM Student Research Competition Ravishka Shemal Rathnasuriya University of Texas at Dallas
	11:00 90m Talk		Program Feature-based Fuzzing Benchmarking SRC - ACM Student Research Competition Miao Miao The University of Texas at Dallas
	11:00 90m Talk		Towards Compatibly Mitigating Technical Lag in Maven Projects SRC - ACM Student Research Competition Rui Lu East China Normal University
	11:00 90m Talk		On the Automation of Code Review Tasks Through Cross-Task Knowledge Distillation SRC - ACM Student Research Competition Oussama Ben Sghaier DIRO, Université de Montréal
	11:00 90m Talk		CASS: Context-Aware Slice Summarization for Debugging Regression Failures SRC - ACM Student Research Competition Sahar Badihi University of British Columbia, Canada Pre-print
	11:00 90m Talk		Revisiting SWE-Bench: On the Importance of Data Quality for LLM-based Code Models SRC - ACM Student Research Competition Reem Aleithan York University, Canada
	11:00 90m Talk		The Balancing Act of Policies in Developing Machine Learning Explanations SRC - ACM Student Research Competition Jacob Tjaden Colby College
	11:00 90m Talk		To Mock or Not to Mock: Divergence in Mocking Practices Between LLM and Developers SRC - ACM Student Research Competition Hanbin Qin Stevens Institute of Technology

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

	11:00 - 12:30	ACM Student Research PresentationsSRC - ACM Student Research Competition at 204 Chair(s): Md Tajmilur Rahman , Lola Burgueño University of Malaga A subset of finalist ACM SRC students will give short presentations in this session. That decision about who will present will be made after the poster sessions, and this schedule will be updated, so don’t rely on the precise timing until just before the session.. They all also have posters in Canada Hall 3 Poster area, with judging to be on Tuesday. Awards will be announced in the banquet on Thursday evening.

	11:00 6m Talk		Automatic Fuzz Drivers for JavaScript with Type Distributions SRC - ACM Student Research Competition Mayant Mukul University of British Columbia
	11:06 6m Talk		CASS: Context-Aware Slice Summarization for Debugging Regression Failures SRC - ACM Student Research Competition Sahar Badihi University of British Columbia, Canada Pre-print
	11:12 6m Talk		Characterising Algorithm Debt in Machine and Deep Learning Systems SRC - ACM Student Research Competition Emmanuel Iko-Ojo Simon Australian National University
	11:18 6m Talk		Consistent Graph Model Generation with Large Language Models SRC - ACM Student Research Competition Boqi Chen McGill University
	11:24 6m Talk		Enhancing OSS Remediation with Patch Backporting SRC - ACM Student Research Competition Lyuye Zhang Nanyang Technological University
	11:30 6m Talk		Identifying Performance-Sensitive Configurations in Software Systems with LLM-Driven Agents SRC - ACM Student Research Competition Zehao Wang Concordia University
	11:36 6m Talk		Improving Formal Methods VisualizationsFormal Methods SRC - ACM Student Research Competition Avinash Palliyil Georgia Institute of Technology
	11:42 6m Talk		MUARF: Leveraging Multi-Agent Workflows for Automated Code Refactoring SRC - ACM Student Research Competition Yisen Xu Software PErformance, Analysis, and Reliability (SPEAR) lab, Concordia University, Montreal, Canada
	11:48 6m Talk		On the Automation of Code Review Tasks Through Cross-Task Knowledge Distillation SRC - ACM Student Research Competition Oussama Ben Sghaier DIRO, Université de Montréal
	11:54 6m Talk		On the Fly Input Refinement for Code Language Models SRC - ACM Student Research Competition Ravishka Shemal Rathnasuriya University of Texas at Dallas
	12:00 6m Talk		Program Feature-based Fuzzing Benchmarking SRC - ACM Student Research Competition Miao Miao The University of Texas at Dallas
	12:06 6m Talk		Revisiting SWE-Bench: On the Importance of Data Quality for LLM-based Code Models SRC - ACM Student Research Competition Reem Aleithan York University, Canada
	12:12 6m Talk		The Balancing Act of Policies in Developing Machine Learning Explanations SRC - ACM Student Research Competition Jacob Tjaden Colby College
	12:18 6m Talk		To Mock or Not to Mock: Divergence in Mocking Practices Between LLM and Developers SRC - ACM Student Research Competition Hanbin Qin Stevens Institute of Technology
	12:24 6m Talk		Towards Compatibly Mitigating Technical Lag in Maven Projects SRC - ACM Student Research Competition Rui Lu East China Normal University

Reem Aleithan

York University, Canada