OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution (ISSTA 2025 - Research Papers)

Who

Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng

Track

ISSTA 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Jun 2025 16:00 - 16:25 at Cosmos 3C - Code Generation with LLMs Chair(s): Yutian Tang

Abstract

The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs’ failure on OmniGIRL, providing insights for future improvements.

DOI

https://doi.org/10.1145/3728871

Lianghong Guo

Sun Yat-sen University

China

Wei Tao

Independent Researcher

China

Runhan Jiang

Sun Yat-sen University

Yanlin Wang

Sun Yat-sen University

China

Jiachi Chen

Sun Yat-sen University

China

Xilin Liu

Huawei Cloud

Yuchi Ma

Huawei Cloud Computing Technologies

China

Mingzhi Mao

Sun Yat-sen University

Hongyu Zhang

Chongqing University

China

Zibin Zheng

Sun Yat-sen University

China

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

16:00 - 17:15	Code Generation with LLMsResearch Papers at Cosmos 3C Chair(s): Yutian Tang University of Glasgow, United Kingdom

16:00 25m Talk		OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution Research Papers Lianghong Guo Sun Yat-sen University, Wei Tao Independent Researcher, Runhan Jiang Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Jiachi Chen Sun Yat-sen University, Xilin Liu Huawei Cloud, Yuchi Ma Huawei Cloud Computing Technologies, Mingzhi Mao Sun Yat-sen University, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University DOI
16:25 25m Talk		ConTested: Consistency-Aided Tested Code Generation with LLM Research Papers Jinhao Dong Peking University, Jun Sun Singapore Management University, Wenjie Zhang National University of Singapore, Jin Song Dong National University of Singapore, Dan Hao Peking University DOI Pre-print
16:50 25m Talk		Causality-Aided Evaluation and Explanation of Large Language Model-based Code Generation Research Papers Zhenlan Ji The Hong Kong University of Science and Technology, Pingchuan Ma HKUST, Li Zongjie Hong Kong University of Science and Technology, Zhaoyu Wang HKUST, Shuai Wang Hong Kong University of Science and Technology DOI

Information for Participants

Thu 26 Jun 2025 16:00 - 17:15 at Cosmos 3C - Code Generation with LLMs Chair(s): Yutian Tang

Info for room Cosmos 3C:

Cosmos 3C is the third room in the Cosmos 3 wing.

When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.