Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search
Background: Code pre-training and large language models are heavily dependent on data quality. These models require a vast, high-quality corpus matching text descriptions with codes to establish semantic correlations between natural and programming languages. Unlike NLP tasks, code comment heavily relies on specialized programming knowledge and is often limited in quantity and variety. Thus, most widely available open-source datasets are established with compromise and noise from platforms, such as StackOverflow, where code snippets are often incomplete. This may lead to significant errors when deploying the trained models in real-world applications. Aims: Comments as a substitute for queries are used to build code search datasets from GitHub. While comments describe code functionality and details, they often contain noise and differ from queries. Thus, our research focuses on improving the syntactic and semantic quality of code comments. Method: We propose a comment-based data refinement framework CoCoRF via an unsupervised and supervised co-learning technique. It applies manually defined rules for syntax filtering and constructs a bootstrap query corpus via the WTFF algorithm for training the TVAE model for further semantic filtering. Results: Our study shows that CoCoRF achieves high efficiency with less computational resource, and outperforms comparison models in DeepCS code search task. Conclusions: Our findings indicate that the CoCoRF framework significantly improves the performance of code search tasks by enhancing the quality of code datasets.
Thu 24 OctDisplayed time zone: Brussels, Copenhagen, Madrid, Paris change
16:00 - 17:30 | Machine learning for software engineeringESEM Technical Papers / ESEM Emerging Results, Vision and Reflection Papers Track / ESEM Journal-First Papers at Telensenyament (B3 Building - 1st Floor) Chair(s): Luigi Quaranta University of Bari, Italy | ||
16:00 20mFull-paper | A Transformer-based Approach for Augmenting Software Engineering Chatbots Datasets ESEM Technical Papers Ahmad Abdellatif University of Calgary, Khaled Badran Concordia University, Canada, Diego Costa Concordia University, Canada, Emad Shihab Concordia University | ||
16:20 20mFull-paper | Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search ESEM Technical Papers Gang Hu School of Information Science & Engineering, Yunnan University, Xiaoqin Zeng School of Information Science & Engineering, Yunnan University, Wanlong Yu , Min Peng , YUAN Mengting School of Computer Science, Wuhan University, Wuhan, China, Liang Duan | ||
16:40 20mFull-paper | Good things come in three: Generating SO Post Titles with Pre-Trained Models, Self Improvement and Post Ranking ESEM Technical Papers Duc Anh Le Hanoi University of Science and Technology, Anh M. T. Bui Hanoi University of Science and Technology, Phuong T. Nguyen University of L’Aquila, Davide Di Ruscio University of L'Aquila Pre-print | ||
17:00 15mVision and Emerging Results | PromptLink: Multi-template prompt learning with adversarial training for issue-commit link recovery ESEM Emerging Results, Vision and Reflection Papers Track Yang Deng The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China, Bangchao Wang Wuhan Textile University, Zhiyuan Zou The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China, Luyao Ye The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China | ||
17:15 15mJournal Early-Feedback | GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT ESEM Journal-First Papers Phuong T. Nguyen University of L’Aquila, Juri Di Rocco University of L'Aquila, Claudio Di Sipio University of l'Aquila, Riccardo Rubei University of L'Aquila, Davide Di Ruscio University of L'Aquila, Massimiliano Di Penta University of Sannio, Italy Link to publication DOI Pre-print |