SANER 2025
Tue 4 - Fri 7 March 2025 Montréal, Québec, Canada
Fri 7 Mar 2025 11:15 - 11:30 at L-1720 - Mining Software Repositories Chair(s): Brittany Reid

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pre-training language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf pre-trained code models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pre-training, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model’s latent features and the task’s labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used open-source PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.

Fri 7 Mar

Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30
11:00
15m
Talk
An Empirical Study of Transformer Models on Automatically Templating GitHub Issue Reports
Research Papers
Jin Zhang Hunan Normal University, Maoqi Peng Hunan Normal University, Yang Zhang National University of Defense Technology, China
11:15
15m
Talk
How to Select Pre-Trained Code Models for Reuse? A Learning PerspectiveBest Paper Award
Research Papers
Zhangqian Bi Huazhong University of Science and Technology, Yao Wan Huazhong University of Science and Technology, Zhaoyang Chu Huazhong University of Science and Technology, Yufei Hu Huazhong University of Science and Technology, Junyi Zhang Huazhong University of Science and Technology, Hongyu Zhang Chongqing University, Guandong Xu University of Technology, Hai Jin Huazhong University of Science and Technology
Pre-print
11:30
7m
Talk
Uncovering the Challenges: A Study of Corner Cases in Bug-Inducing Commits
Early Research Achievement (ERA) Track
Atakan Şerifoğlu Bilkent University, Eray Tüzün Bilkent University
11:37
15m
Talk
A Bot Identification Model and Tool Based on GitHub Activity Sequences
Journal First Track
Natarajan Chidambaram University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons
11:52
15m
Talk
Does the Tool Matter? Exploring Some Causes of Threats to Validity in Mining Software Repositories
Reproducibility Studies and Negative Results (RENE) Track
Nicole Hoess Technical University of Applied Sciences Regensburg, Carlos Paradis No Affiliation, Rick Kazman University of Hawai‘i at Mānoa, Wolfgang Mauerer Technical University of Applied Sciences Regensburg