EASE 2024
Tue 18 - Fri 21 June 2024 Salerno, Italy

Machine Translation (MT) has numerous applications in Software Engineering (SE). Recently, it has been employed not only for programming language translation but also as an oracle for deriving information for various research problems in SE. In this application branch, MT’s impact has been assessed through metrics measuring the accuracy of these problems rather than traditional translation evaluation metrics. For code search, a recent work, ASTTrans, introduced an MT-based model for extracting relevant non-terminal nodes from the Abstract Syntax Tree (AST) of an implementation based on natural language descriptions. While ASTTrans demonstrated the effectiveness of MT in enhancing code search on small datasets with low embedding dimensions, it struggled to improve the accuracy of code search on the standard benchmark CodeSearchNet. In this work, we present Oracle4CS, a novel approach that integrates the classical MT model called Statistical Machine Translation to support modernized models for code search. To accomplish this, we introduce a new code representation technique called ASTSum, which summarizes each code snippet using a limited number of AST nodes. Additionally, we devise a fresh approach to code search, replacing natural language queries with a new representation that incorporates the results of our query-to-ASTSum translation process. Through experiments, we demonstrate that Oracle4CS can enhance code search performance on both the original BERT-based model UniXcoder and the optimized BERT-based model CoCoSoDa by up to 1.18% and 2% in Mean Reciprocal Rank (MRR) across eight selected well-known datasets. We also explore ASTSum as a promising code representation for supporting code search, potentially improving MRR by over 17% on average when paired with an optimal SMT model for query-to-ASTSum translation.