SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation (FORGE 2025 - Data and Benchmarking)

Who

Ivan Petrukha, Yana Kurliak, Nataliia Stulova

Track

FORGE 2025 Data and Benchmarking

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 27 Apr 2025 14:48 - 14:54 at 207 - Session1: FM for Code Generation Chair(s): Lili Wei

Abstract

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming languages, such as Swift, with high quality. By examining widely established multilingual benchmarks like HumanEval-XL and MultiPL-E, we identified critical issues specific to their Swift components, making them insufficient or even irrelevant for assessing LLM coding capabilities on Swift. Unlike these existing approaches, which prioritize rapid scaling and generalization by automatically translating Python-centric benchmarks with LLMs, we adopt a quality-over-quantity methodology. We present SwiftEval, the first Swift-oriented benchmark consisting of 28 carefully hand-crafted problems, and evaluate 44 popular Code LLMs. Our experimental results demonstrate that this tailored approach provides a more accurate and nuanced evaluation of code generation, thoughtfully accounting the distinctive features of the programming language.

Ivan Petrukha

MacPaw

Yana Kurliak

MacPaw

Nataliia Stulova

MacPaw

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 27 Apr
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Session1: FM for Code Generation Research Papers / Data and Benchmarking at 207 Chair(s): Lili Wei McGill University

14:00 12m Long-paper		RepoHyper: Search-Expand-Refine on Semantic Graphs for Repository-Level Code Completion Research Papers Huy Nhat Phan FPT Software AI Center, Hoang Nhat Phan Nanyang Technological University, Tien N. Nguyen University of Texas at Dallas, Nghi D. Q. Bui Salesforce Research
14:12 12m Long-paper		SoTaNa: An Open-Source Software Engineering Instruction-Tuned Model Research Papers Ensheng Shi Xi’an Jiaotong University, Yanlin Wang Sun Yat-sen University, Fengji Zhang Microsoft Research Asia, Bei Chen Microsoft Research Asia, Hongyu Zhang Chongqing University, Yanli Wang Sun Yat-sen University, Daya Guo Sun Yat-sen University, Lun Du Microsoft Research, Shi Han Microsoft Research, Dongmei Zhang Microsoft Research, Hongbin Sun Xi’an Jiaotong University
14:24 12m Long-paper		Automated Codebase Reconciliation using Large Language Models Research Papers Aneri Gandhi University of Toronto, Sanjukta De Advanced Micro Devices, Marsha Chechik University of Toronto, Vinay Pandit Advanced Micro Devices, Max Kiehn Advanced Micro Devices, Matthieu Chan Chee Advanced Micro Devices, Yonas Bedasso Advanced Micro Devices
14:36 12m Long-paper		AI-Powered, But Power-Hungry? Energy Efficiency of LLM-Generated Code Research Papers Lola Solovyeva University of Twente, Sophie Weidmann University of Twente, Fernando Castor University of Twente
14:48 6m Short-paper		SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation Data and Benchmarking Ivan Petrukha MacPaw, Yana Kurliak MacPaw, Nataliia Stulova MacPaw
14:54 6m Short-paper		SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering Research Papers Zhimin Zhao Queen's University
15:00 12m Long-paper		PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback Research Papers Yun Peng The Chinese University of Hong Kong, Akhilesh Deepak Gotmare Salesforce Research, Michael Lyu The Chinese University of Hong Kong, Caiming Xiong Salesforce Research, Silvio Savarese Salesforce Research, Doyen Sahoo Salesforce Research
15:12 6m Short-paper		HyRACC: A Hybrid Retrieval-Augmented Framework for More Efficient Code Completion Research Papers Chuanyi Li Nanjing University, Jiwei Shang Nanjing University, Yi Feng Nanjing University, Bin Luo Nanjing University
15:18 6m Short-paper		OptCodeTrans: Boost LLMs on Low-Resource Programming Language Translation Research Papers Jianbo Lin Nanjing University, Yi Shen Nanjing University, Chuanyi Li Nanjing University, Changan Niu Software Institute, Nanjing University, Bin Luo Nanjing University