FuseApplyBench: Multilingual Benchmark for Trustworthy Code Edit Applying Task
With the rise of Language Models (LMs) and Large Language Models (LLMs), their potential for code editing (CE) has gained attention. A common approach is to have LLMs generate draft code modifications, which are then refined by smaller LMs in further Code Editing Apply (CEA) task. However, the CEA task is prone to errors, and existing benchmarks do not systematically evaluate LLM performance in handling these issues. To address this, we introduce FuseApplyBench, a benchmark designed to evaluate LLM performance across three major error types in CEA tasks. Atop FuseApplyBench’s pipeline, we collect datasets to perform fine-tuning, enhancing code modifications’ reliability (denoted as FuseApply-7B). We benchmark FuseApply-7B, four widely used open source LLMs, and Kortix-FastApply-7B on FuseApplyBench. Results show that FuseApply-7B significantly improves trustworthiness and accuracy metrics, while other models demonstrate weaker performance, highlighting opportunities for advancing LLM applications in CE.
Sat 28 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
09:00 - 10:30 | Trustworthy AI for CodeEXPRESS at Cosmos 3B Chair(s): Peng Di Ant Group & UNSW Sydney, Puzhuo Liu Ant Group & Tsinghua University | ||
09:00 10mDay opening | Opening and Welcome EXPRESS | ||
09:10 60mKeynote | Human-like AI Auditor for Code Repositories EXPRESS Xiangyu Zhang Purdue University | ||
10:10 20mTalk | FuseApplyBench: Multilingual Benchmark for Trustworthy Code Edit Applying Task EXPRESS Ming Liang Ant Group, Qingyu Zhang the University of Hong Kong, Zhipeng Zuo Ant Group, Shaoqiang Zheng Ant Group, Dajun Chen Ant Group, Wei Jiang Ant Group, Yong Li Ant Group |
Cosmos 3B is the second room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.