Synthetic Repo-level Bug Dataset for Training Automated Program Repair Models (ICSE 2026 - Research Track)

ICSE 2026

Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

Attending
Sponsors
- ICSE 2026 Sponsors
- Sponsorships Opportunities
Program
Tracks
Organization
Search
Series
- Series
- ICSE 2026
- ICSE 2025
- ICSE 2024
- ICSE 2023
- ICSE 2022
- ICSE 2021
- ICSE 2020
- ICSE 2019
- * ICSE 2018 *

ICSE 2026 (series) / Research Track /

Synthetic Repo-level Bug Dataset for Training Automated Program Repair Models

Who

Minh V. T. Pham, Huy N. Phan, Hoang Nhat Phan, Cuong Chi Le, Tien N. Nguyen, Nghi D. Q. Bui

Track

ICSE 2026 Research Track

Abstract

Automated program repair (APR) aims to autonomously fix software bugs, yet its effectiveness is hampered by the lack of diverse, real-world bug datasets essential for model training. Although combining large-scale mining with human effort can yield such datasets, the associated costs limit scalability. To address this, we introduce a novel, scalable synthetic data pipeline that leverages large language models (LLMs) to generate synthetic bugs through targeted LLM-based code rewriting. Our pipeline is also capable of synthesizing valuable intermediate repair steps and enriches the training signal toward correct fixes. Using our method, we create SWE-Synth, a large and contextually rich dataset of bug-fix pairs that are natural, scalable, automated verifiable, and contain intermediate repair steps. Training LLMs on our synthetic dataset yields context-aware repair strategies, that achieve repair accuracy equivalent to those trained on manually curated datasets from Github like SWE-Gym while delivering superior scalability with effortless bug synthesis, as demonstrated on popular benchmarks (SWE-Bench and BugsInPy).

Minh V. T. Pham

FPT Software AI Center

Huy N. Phan

FPT Software AI Center

Hoang Nhat Phan

Nanyang Technological University

Cuong Chi Le

FPT Software AI Center

Tien N. Nguyen

University of Texas at Dallas

United States

Nghi D. Q. Bui

Salesforce Research

Singapore

xMon 8 Dec 02:40