How Effective Do Code Language Models Understand Poor-Readability Code? (ASE 2024 - Research Papers)

Who

Chao Hu, Yitian Chai, Hao Zhou, Fandong Meng, Jie Zhou, Xiaodong Gu

Track

ASE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Oct 2024 13:30 - 13:45 at Camellia - LLM for SE 1 Chair(s): Chengcheng Wan

Abstract

Code language models such as CodeT5 and CodeLlama have demonstrated substantial achievement in code comprehension. While the majority of research efforts have focused on improving model architectures and training processes, we find that the current benchmarks used for evaluating code comprehension models are confined to high-readability code, regardless of the popularity of low-readability code in reality. As such, they are inadequate to demonstrate the fine-grained ability of models, particularly the robustness to varying readability degrees. In this paper, we comprehensively analyze the robustness of code summarization models to code with varying readability, including seven obfuscated datasets derived from existing benchmarks. Our findings indicate that current code comprehension models are sensitive to code with varying readability. In particular, their performance predominantly depends on semantic cues within the code, often neglecting the syntactic aspects. Existing benchmarks are biased toward evaluating semantic features, thereby overlooking the models’ ability to understand non-sensitive syntactic features. Based on the findings, we present R-CodeSumEval, a new evaluation benchmark on code summarization tasks. R-CodeSumEval innovatively introduces readability into the testing process, considering semantic, syntactic, and their cross-obfuscation, thereby providing a more comprehensive and rigorous evaluation of code summarization models. Our studies also provide more insightful suggestions for future research, such as constructing new benchmarks to evaluate the robustness of models on poor-readability code, proposing readability-awareness metrics, and automatic methods for code data cleaning and normalization.

Chao Hu

Shanghai Jiao Tong University

Yitian Chai

School of Software, Shanghai Jiao Tong University

Hao Zhou

Pattern, Recognition Center, WeChat, Tencent

Fandong Meng

WeChat AI, Tencent

Jie Zhou

Tencent

Xiaodong Gu