Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models (ICSE 2024 - Research Track)

Fri 12 - Sun 21 April 2024 Lisbon, Portugal

Who

Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, Michael Lyu

Track

ICSE 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 Apr 2024 15:00 - 15:15 at Almada Negreiros - Language Models and Generated Code 3 Chair(s): Jie M. Zhang

Abstract

Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling, i.e., generating pseudo labels for unlabeled data and further training the pre-trained code models with the pseudo-labeled data. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.

In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: HybrId pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we further propose to employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions. We evaluate the effectiveness of HINT on three popular code intelligence tasks, including code summarization, defect detection, and assertion generation. We build our method on top of three popular open-source pre-trained code models. The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.

Shuzheng Gao

Wenxin Mao

Harbin Institute of Technology

Cuiyun Gao

Harbin Institute of Technology

China

Li Li

Beihang University

China

Xing Hu

Zhejiang University

China

Xin Xia

Huawei Technologies

China

Michael Lyu

The Chinese University of Hong Kong

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Language Models and Generated Code 3Research Track / Demonstrations at Almada Negreiros Chair(s): Jie M. Zhang King's College London

14:00 15m Talk		CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models Research Track Hao Yu Peking University, Bo Shen Huawei Cloud Computing Technologies Co., Ltd., Dezhi Ran Peking University, Jiaxin Zhang Huawei Cloud Computing Technologies Co., Ltd., Qi Zhang Huawei Cloud Computing Technologies Co., Ltd., Yuchi Ma Huawei Cloud Computing Technologies CO., LTD., Guangtai Liang Huawei Cloud Computing Technologies, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Qianxiang Wang Huawei Technologies Co., Ltd, Tao Xie Peking University
14:15 15m Talk		Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment Research Track Shibbir Ahmed Iowa State University, Hongyang Gao Dept. of Computer Science, Iowa State University, Hridesh Rajan Iowa State University
14:30 15m Talk		GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for Code Research Track Qihao Zhu Peking University, Qingyuan Liang Peking University, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Yingfei Xiong Peking University, Lu Zhang Peking University, Shengyu Cheng ZTE Corporation
14:45 15m Talk		On Calibration of Pre-trained Code models Research Track Zhenhao Zhou Fudan University, Chaofeng Sha Fudan University, Xin Peng Fudan University DOI Media Attached
15:00 15m Talk		Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models Research Track Shuzheng Gao , Wenxin Mao Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Li Li Beihang University, Xing Hu Zhejiang University, Xin Xia Huawei Technologies, Michael Lyu The Chinese University of Hong Kong
15:15 7m Talk		GitHubInclusifier: Finding and fixing non-inclusive language in GitHub Repositories Demonstrations Liam Todd Monash University, John Grundy Monash University, Christoph Treude Singapore Management University Pre-print Media Attached