APSEC 2024
Tue 3 - Fri 6 December 2024 China

Given a query in natural language, code search is designed to search the corresponding target code from a code base, which can accelerate the software development process. Recent pre-trained code models based on deep learning can capture the semantic connection between programming language and natural language, generating more accurate vector representations for codes and queries, significantly improving the matching accuracy between programming language and natural language. However, in recent years, most research on code search only focuses on improving the accuracy of code search while neglecting the importance of efficiency. In this paper, we propose a novel code search framework CoSTV to speed up the code search process. CoSTV employs a two-stage paradigm to combine the advantages of both bi-encoder and cross-encoder in terms of efficiency and accuracy, decoupling the code search procedure into recall and re-rank stages. Specifically, we introduce a vector retrieval system, program simplification, and knowledge distillation approaches to substantially accelerate code search while retaining parallel accuracy. In the recall stage, CoSTV utilizes a bi-encoder code search model and vector retrieval engine to rapidly recall highly relevant code candidates. In the re-rank stage, CoSTV employs a cross-encoder-based code search model, program simplification, and model distillation to enhance the precision of code search. Extensive experiments conducted on the CodeSearchNet dataset indicate that compared with previous code search baselines, CoSTV can reduce the time of code search by 79.1% while improving the accuracy of code search by 7.93% on average.