Investigating the Efficacy of Large Language Models for Code Clone Detection (ICPC 2024 - Early Research Achievements (ERA))

Who

Mohamad Khajezade, Jie JW Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, Mohamed S Shehata

Track

ICPC 2024 Early Research Achievements (ERA)

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 15 Apr 2024 15:10 - 15:14 at Sophia de Mello Breyner Andresen - Code + Documentation Generation Chair(s): Massimiliano Di Penta

Abstract

Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task.

\textbf{Goal:} GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These tasks are generative' tasks. However, there is limited research on the usage of LLMs fornon-generative’ tasks such as classification using the prompt-based paradigm. In this preliminary exploratory study, we investigated the applicability of LLMs for Code Clone Detection(CCD), a non-generative task.

\textbf{Method:} By building a mono-lingual and cross-lingual CCD dataset derived from CodeNet, we first investigated two different prompts using ChatGPT to detect Type-4 code clones in Java-Java and Java-Ruby pairs in the zero-shot setting. We then conducted an analysis to understand the strengths and weaknesses of ChatGPT in CCD.

\textbf{Results:} ChatGPT surpasses the baselines in cross-language CCD and achieves comparable performance to fully fine-tuned models for mono-lingual CCD. Also, the prompt and the difficulty level of the problems have an impact on the performance of ChatGPT. Finally, we provide insights and future directions based on our initial analysis\footnote{Our code and data are open-sourced at \url{https://anonymous.4open.science/r/largeLanguageModels-4A1F}}.

Mohamad Khajezade

University of British Columbia Okanagan

Jie JW Wu

University of British Columbia (UBC)

Canada

Fatemeh Hendijani Fard

University of British Columbia

Canada

Gema Rodríguez-Pérez

University of British Columbia (UBC)

Canada

Mohamed S Shehata

University of British Columbia

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 15 Apr
Displayed time zone: Lisbon change

14:00 - 15:30	Code + Documentation GenerationResearch Track / / Early Research Achievements (ERA) / Replications and Negative Results (RENE) at Sophia de Mello Breyner Andresen Chair(s): Massimiliano Di Penta University of Sannio, Italy

14:00 10m Talk		MESIA: Understanding and Leveraging Supplementary Nature of Method-level Comments for Automatic Comment GenerationICPCICPC Full paper Research Track Xinglu Pan Peking University, Chenxiao Liu Peking University, Yanzhen Zou Peking University, Tao Xie Peking University, Bing Xie Peking University Pre-print
14:10 10m Talk		Compositional API Recommendation for Library-Oriented Code GenerationICPCICPC Full paper Research Track Zexiong Ma Peking University, Shengnan An Xi’an Jiaotong University, Bing Xie Peking University, Zeqi Lin Microsoft Research, China Pre-print
14:20 10m Talk		On the Generalizability of Deep Learning-based Code Completion Across Programming Language VersionsICPCICPC Full paper Research Track Matteo Ciniselli Università della Svizzera Italiana, Alberto Martin-Lopez Software Institute - USI, Lugano, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
14:30 10m Talk		ESGen: Commit Message Generation Based on Edit Sequence of Code ChangeICPCICPC Full paperVirtual-Talk Research Track Xiangping Chen Sun Yat-sen University, Yangzi Li SUN YAT-SEN UNIVERSITY, Zhicao Tang SUN YAT-SEN UNIVERSITY, Yuan Huang School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China, Haojie Zhou School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China, Mingdong Tang Guangdong University of Foreign Studies, Zibin Zheng Sun Yat-sen University
14:40 10m Talk		Improving AST-Level Code Completion with Graph Retrieval and Multi-Field AttentionICPCICPC Full paperVirtual-Talk Research Track Yu Xia Central South University, Tian Liang Central South University, Wei-Huan Min Central South University, Li Kuang School of Computer Science and Engineering, Central South University
14:50 10m Talk		Exploring and Improving Code Completion for Test CodeICPCICPC Full paper Research Track Tingwei Zhu Nanjing University, Zhongxin Liu Zhejiang University, Tongtong Xu Huawei, Ze Tang Software Institute, Nanjing University, Tian Zhang Nanjing University, Minxue Pan Nanjing University, Xin Xia Huawei Technologies
15:00 10m Talk		Understanding the Impact of Branch Edit Features for the Automatic Prediction of Merge Conflict ResolutionsICPCICPC RENE Paper Replications and Negative Results (RENE) Waad riadh aldndni Virginia Tech, Francisco Servant ITIS Software, University of Malaga, Na Meng Virginia Tech
15:10 4m Talk		Investigating the Efficacy of Large Language Models for Code Clone DetectionICPCICPC ERA Paper Early Research Achievements (ERA) Mohamad Khajezade University of British Columbia Okanagan, Jie JW Wu University of British Columbia (UBC), Fatemeh Hendijani Fard University of British Columbia, Gema Rodríguez-Pérez University of British Columbia (UBC), Mohamed S Shehata University of British Columbia
15:14 16m Talk		Code + Documentation Generation: Panel with SpeakersICPC Discussion