GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for Code (ICSE 2024 - Research Track)

Who

Qihao Zhu, Qingyuan Liang, Zeyu Sun, Yingfei Xiong, Lu Zhang, Shengyu Cheng

Track

ICSE 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 Apr 2024 14:30 - 14:45 at Almada Negreiros - Language Models and Generated Code 3 Chair(s): Jie M. Zhang

Abstract

Pretrained models for code have exhibited promising performance across various code-related tasks, such as code summarization, code completion, code translation, and bug detection. However, despite their success, the majority of current models still represent code as a token sequence, which may not adequately capture the essence of the underlying code structure.

In this work, we propose GrammarT5, a grammar-integrated encoder-decoder pretrained neural model for code. GrammarT5 employs a novel grammar-integrated representation, Tokenized Grammar Rule Sequence (TGRS), for code. TGRS is constructed based on the grammar rule sequence utilized in syntax-guided code generation and integrates syntax information with code tokens within an appropriate input length. Furthermore, we suggest attaching language flags to help GrammarT5 differentiate between grammar rules of various programming languages. Finally, we introduce two novel pre-training tasks—Edge Prediction (EP), and Sub-Tree Prediction (STP)—for GrammarT5 to learn syntactic information.

Experiments were conducted on five code-related tasks using eleven datasets, demonstrating that GrammarT5 achieves state-of-the-art performance on most tasks in comparison to models of the same scale. Additionally, the paper illustrates that the proposed pretraining tasks and language flags can enhance GrammarT5 to better capture the syntax and semantics of code.

Qihao Zhu

Peking University

China

Qingyuan Liang

Peking University

Zeyu Sun

Institute of Software, Chinese Academy of Sciences

Yingfei Xiong

Peking University

China

Lu Zhang

Peking University

China

Shengyu Cheng

ZTE Corporation

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Language Models and Generated Code 3Research Track / Demonstrations at Almada Negreiros Chair(s): Jie M. Zhang King's College London

14:00 15m Talk		CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models Research Track Hao Yu Peking University, Bo Shen Huawei Cloud Computing Technologies Co., Ltd., Dezhi Ran Peking University, Jiaxin Zhang Huawei Cloud Computing Technologies Co., Ltd., Qi Zhang Huawei Cloud Computing Technologies Co., Ltd., Yuchi Ma Huawei Cloud Computing Technologies CO., LTD., Guangtai Liang Huawei Cloud Computing Technologies, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Qianxiang Wang Huawei Technologies Co., Ltd, Tao Xie Peking University
14:15 15m Talk		Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment Research Track Shibbir Ahmed Iowa State University, Hongyang Gao Dept. of Computer Science, Iowa State University, Hridesh Rajan Iowa State University
14:30 15m Talk		GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for Code Research Track Qihao Zhu Peking University, Qingyuan Liang Peking University, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Yingfei Xiong Peking University, Lu Zhang Peking University, Shengyu Cheng ZTE Corporation
14:45 15m Talk		On Calibration of Pre-trained Code models Research Track Zhenhao Zhou Fudan University, Chaofeng Sha Fudan University, Xin Peng Fudan University DOI Media Attached
15:00 15m Talk		Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models Research Track Shuzheng Gao , Wenxin Mao Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Li Li Beihang University, Xing Hu Zhejiang University, Xin Xia Huawei Technologies, Michael Lyu The Chinese University of Hong Kong
15:15 7m Talk		GitHubInclusifier: Finding and fixing non-inclusive language in GitHub Repositories Demonstrations Liam Todd Monash University, John Grundy Monash University, Christoph Treude Singapore Management University Pre-print Media Attached