APSEC 2024
Tue 3 - Fri 6 December 2024 China

In the field of software engineering automation, code language models have made significant strides in code generation tasks. However, due to the cost of updating knowledge and the issue of hallucinations, code language models (CLMs) face challenges in practical code generation scenarios, making retrieval-augmented code generation a mainstream approach. Existing retrieval-augmented methods only build codebases for a single programming language, which is insufficient to address the lack of monolingual knowledge. To address this, we propose CodeRCSG, a novel cross-lingual retrieval-augmented code generation method. This method constructs a multilingual codebase and creates a unified cross-lingual code semantic graph to capture deep semantic information across different programming languages. By encoding the retrieved code semantic graph with GNN and combining it with input text embeddings, code language models can effectively utilize the transferred cross-lingual programming knowledge to improve the quality of generated code. Experimental results show that CodeRCSG can significantly enhance the code generation capabilities of code language models.