CoSTV: Accelerating Code Search with Two-Stage Paradigm and Vector Retrieval
Given a query in natural language, code search is designed to search the corresponding target code from a code base, which can accelerate the software development process. Recent pre-trained code models based on deep learning can capture the semantic connection between programming language and natural language, generating more accurate vector representations for codes and queries, significantly improving the matching accuracy between programming language and natural language. However, in recent years, most research on code search only focuses on improving the accuracy of code search while neglecting the importance of efficiency. In this paper, we propose a novel code search framework CoSTV to speed up the code search process. CoSTV employs a two-stage paradigm to combine the advantages of both bi-encoder and cross-encoder in terms of efficiency and accuracy, decoupling the code search procedure into recall and re-rank stages. Specifically, we introduce a vector retrieval system, program simplification, and knowledge distillation approaches to substantially accelerate code search while retaining parallel accuracy. In the recall stage, CoSTV utilizes a bi-encoder code search model and vector retrieval engine to rapidly recall highly relevant code candidates. In the re-rank stage, CoSTV employs a cross-encoder-based code search model, program simplification, and model distillation to enhance the precision of code search. Extensive experiments conducted on the CodeSearchNet dataset indicate that compared with previous code search baselines, CoSTV can reduce the time of code search by 79.1% while improving the accuracy of code search by 7.93% on average.
Thu 5 DecDisplayed time zone: Beijing, Chongqing, Hong Kong, Urumqi change
14:00 - 15:30 | Session (10)Technical Track / SEIP - Software Engineering in Practice at Room 3 (Xiangquan Ballroom) Chair(s): In-Young Ko Korea Advanced Institute of Science and Technology | ||
14:00 30mTalk | Why not Just Look For Answers? Using A More Direct Way for API Recommendation Technical Track Changxin Liu Chongqing University, Ling Xu School of Big Data & Software Engineering, Chongqing University, Wenhan Mu Chongqing University, Rui Qin Chongqing University | ||
14:30 30mTalk | Learning Heterogeneous Abstract Code Graph Representations For Program Comprehension Technical Track Shenning Song The College of Computer Science and Technology, Jilin University, Mengxi Zhang The College of Computer Science and Technology, Jilin University, Shaoquan Li The College of Computer Science and Technology, Jilin University, huaxiao liu The College of Computer Science and Technology, Jilin University | ||
15:00 20mTalk | CoSTV: Accelerating Code Search with Two-Stage Paradigm and Vector Retrieval SEIP - Software Engineering in Practice Dewu Zheng Sun yat-sen University, Yanlin Wang Sun Yat-sen University, Wenqing Chen Sun Yat-sen University, Jiachi Chen Sun Yat-sen University, Zibin Zheng Sun Yat-sen University |