ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

Automated program repair (APR) aims to autonomously fix software bugs, yet its effectiveness is hampered by the lack of diverse, real-world bug datasets essential for model training. Although combining large-scale mining with human effort can yield such datasets, the associated costs limit scalability. To address this, we introduce a novel, scalable synthetic data pipeline that leverages large language models (LLMs) to generate synthetic bugs through targeted LLM-based code rewriting. Our pipeline is also capable of synthesizing valuable intermediate repair steps and enriches the training signal toward correct fixes. Using our method, we create SWE-Synth, a large and contextually rich dataset of bug-fix pairs that are natural, scalable, automated verifiable, and contain intermediate repair steps. Training LLMs on our synthetic dataset yields context-aware repair strategies, that achieve repair accuracy equivalent to those trained on manually curated datasets from Github like SWE-Gym while delivering superior scalability with effortless bug synthesis, as demonstrated on popular benchmarks (SWE-Bench and BugsInPy).