Filtering before Tuning: Robust Fine-Tuning of Large Code Models under Noisy Labels
Fine-tuning plays a crucial role in adapting large code models (LCMs) to specific software engineering tasks. However, fine-tuning LCMs requires perfectly labeled datasets, which are rarely available in practice. Noisy labels in the training data can significantly impair the generalization ability and overall performance of fine-tuned LCMs. Previous work has primarily focused on the problem of noisy labels in training models from scratch, while this problem remains largely unexplored in the context of fine-tuning LCMs.
To fill this gap, this paper proposes RobustFT, the first approach for fine-tuning LCMs in the presence of noisy labels. The core of RobustFT is to distinguish noisy labels from clean ones based on training dynamics observed during the fine-tuning process. Our insight is that, during fine-tuning, the trajectories of mislabeled samples in the latent feature space are significantly longer than those of clean samples. After filtering out noisy labels, RobustFT restarts the fine-tuning process using only the selected clean samples, thus producing a more effective LCM. We evaluate RobustFT on 36 diverse subjects, covering multiple LCMs, code datasets, and varying types and ratios of noisy labels. The results show that RobustFT outperforms five baselines in both identifying noisy labels and enhancing the fine-tuning effectiveness of LCMs.
Thu 16 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
16:00 - 17:30 | AI for Software Engineering 19Research Track at Oceania IX Chair(s): Fabio Palomba University of Salerno | ||
16:00 15mTalk | An Eye for AI: Eye-Tracking the Micro-Interruptions of GenAI Code SuggestionsArtifact Award Winner Research Track Pre-print Media Attached | ||
16:15 15mTalk | Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse Research Track Aaron Imani University of California, Irvine, Mohammad Moshirpour University of California, Irvine, Iftekhar Ahmed University of California at Irvine Pre-print Media Attached | ||
16:30 15mTalk | Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning Research Track Zhaoyang Chu Huazhong University of Science and Technology, Yao Wan Huazhong University of Science and Technology, Zhikun Zhang Zhejiang University, Di Wang King Abdullah University of Science and Technology, Zhou Yang University of Alberta, Alberta Machine Intelligence Institute , Hongyu Zhang Chongqing University, Pan Zhou Huazhong University of Science and Technology, Xuanhua Shi Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology, David Lo Singapore Management University Pre-print | ||
16:45 15mTalk | What Makes Code Generation Ethically Sourced?Distinguished Paper Award Research Track Zhuolin Xu Concordia University, Chenglin Li Concordia University, Qiushi Li Concordia University, Shin Hwei Tan Concordia University | ||
17:00 15mTalk | Filtering before Tuning: Robust Fine-Tuning of Large Code Models under Noisy Labels Research Track Zhong Li Nanjing University, Yang Chen China Automobile Data of Tianjin Co., Ltd. China Automotive Technology&Research Center Co.,Ltd., Heng Yong Nanjing University, Yuanyi Lin Huawei Technologies, Jiali Zhao Huawei, Tongtong Xu Huawei, Minxue Pan Nanjing University, Tian Zhang Nanjing University, Xuandong Li Nanjing University | ||
17:15 15mTalk | Automating Requirements Formalization: Using LLMs and Low-Complexity Distinguishing Traces for Semantic Validation Research Track Daniel Mendoza Stanford University, Anastasia Mavridou KBR / NASA Ames Research Center, Andreas Katis KBR / NASA Ames Research Center, Caroline Trippel Stanford University | ||