REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (SCAM 2024 - Research Track)

Who

Anthony Saieva, Saikat Chakraborty, Gail Kaiser

Track

SCAM 2024 Research Track

Time Zone

The program is currently displayed in (GMT-07:00) Arizona.

Use conference time zone: (GMT-07:00) ArizonaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 8 Oct 2024 11:21 - 11:37 at Fremont - Program Analysis and Generation Chair(s): Patrick Lam

Abstract

This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our ap- proach is consistent across various model architectures and pro- gramming languages. We outperform the state-of-the-art cross- language search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar ref- erences are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.

Link to Preprint

https://arxiv.org/abs/2305.03843

Anthony Saieva

IBM Research

United States

Saikat Chakraborty

Microsoft Research

United States

Gail Kaiser

Columbia University

United States

Time Zone

The program is currently displayed in (GMT-07:00) Arizona.

Use conference time zone: (GMT-07:00) ArizonaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 8 Oct
Displayed time zone: Arizona change

10:30 - 12:00	Program Analysis and GenerationResearch Track at Fremont Chair(s): Patrick Lam University of Waterloo

10:30 16m Research paper		AUTOGENICS: Automated Generation of Context-Aware Inline Comments for Code Snippets on Programming Q&A Sites Using LLM Research Track Suborno Deb Bappon Department of Computer Science, University of Saskatchewan, Canada, Saikat Mondal University of Saskatchewan, Banani Roy University of Saskatchewan Pre-print
10:47 16m Research paper		Code Search Oriented Node-Enhanced Control Flow Graph EmbeddingVideo Presentation Research Track Yang Xu , WenLiang Peng South China University of Technology
11:04 16m Research paper		FRANC: A Lightweight Framework for High-Quality Code Generation Research Track Mohammed Latif Siddiq University of Notre Dame, Beatrice Casey University of Notre Dame, Joanna C. S. Santos University of Notre Dame Pre-print
11:21 16m Research paper		REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models Research Track Anthony Saieva IBM Research, Saikat Chakraborty Microsoft Research, Gail Kaiser Columbia University Pre-print
11:40 20m Live Q&A		Discussion (Program Analysis and Generation) Research Track