OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs’ failure on OmniGIRL, providing insights for future improvements.
Thu 26 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
16:00 - 17:15 | Code Generation with LLMsResearch Papers at Cosmos 3C Chair(s): Yutian Tang University of Glasgow, United Kingdom | ||
16:00 25mTalk | OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution Research Papers Lianghong Guo Sun Yat-sen University, Wei Tao Independent Researcher, Runhan Jiang Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Jiachi Chen Sun Yat-sen University, Xilin Liu Huawei Cloud, Yuchi Ma Huawei Cloud Computing Technologies, Mingzhi Mao Sun Yat-sen University, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University DOI | ||
16:25 25mTalk | ConTested: Consistency-Aided Tested Code Generation with LLM Research Papers Jinhao Dong Peking University, Jun Sun Singapore Management University, Wenjie Zhang National University of Singapore, Jin Song Dong National University of Singapore, Dan Hao Peking University DOI Pre-print | ||
16:50 25mTalk | Causality-Aided Evaluation and Explanation of Large Language Model-based Code Generation Research Papers Zhenlan Ji The Hong Kong University of Science and Technology, Pingchuan Ma HKUST, Li Zongjie Hong Kong University of Science and Technology, Zhaoyu Wang HKUST, Shuai Wang Hong Kong University of Science and Technology DOI |
Cosmos 3C is the third room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.