Effectiveness of symmetric metamorphic relations on validating the stability of code generation LLM
This program is tentative and subject to change.
Pre-trained large language models (LLMs) are increasingly used in software development for code generation to enhance productivity. Companies often prefer private LLMs over public ones to mitigate the risk of exposing corporate secrets. Validating the stability of the outputs from these LLMs is crucial, and our study proposes using symmetric Metamorphic Relations (MRs) from Metamorphic Testing (MT) for this purpose. Our study involved an empirical experiment with eight private LLMs, two public LLMs and two publicly available datasets. The test scenario simulated a software development environment where private LLMs were used to generate source codes for software enhancements and maintenance, while ensuring that the company’s software assets remained secure and unexposed. We defined seven symmetric MRs that are based on the principles of symmetry and semantic preservation. These MRs were used to generate “Follow-up” datasets from “Source” datasets for testing purposes. Our evaluation aimed to detect the occurrence of violations (inconsistent predictions) between “Source” and “Follow-up” datasets. We then assess the effectiveness of MRs in identifying correct and incorrect non-violated predictions from ground truths, as well as how the MRs influenced LLM performance by measuring the correctness of the generated codes. Results showed that one public and four private LLMs did not violate “Case transformation of prompts” MR. Furthermore, the effectiveness and performance results indicated that proposed MRs effectively explain the instability of LLM’s outputs through “Case transformation of prompts”, “Duplication of prompts”, and “Paraphrasing of prompts”. Additionally, the findings revealed that using a mix of LLM technologies and pre-training with a vast number of datasets can cause the MRs under study to be ineffective. Most LLMs, except copilot, demonstrated low semantic understanding and primarily relied on statistical patterns to interpret the prompts. This study demonstrated that the proposed MRs could serve as a validation tool with violation measurements for the stability of code generation LLMs’ outputs under the simulation, where no ground truth is available. The effectiveness and performance results indicated that proposed MRs are indeed effective tools for explaining the instability of LLM’s outputs. Moreover, the findings highlighted the necessity for enhancing LLMs’ semantic understanding of prompts to improve stability and suggested potential future research directions. These include exploring different MRs, enhancing semantic understanding, and applying symmetry to prompt engineering.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
14:00 - 15:30 | |||
14:00 10mTalk | QuanBench: Benchmarking Quantum Code Generation with Large Language Models Research Papers | ||
14:10 10mTalk | Token Sugar: Making Source Code Sweeter for LLMs through Token-Efficient Shorthand Research Papers Zhensu Sun Singapore Management University, Chengran Yang Singapore Management University, Singapore, Xiaoning Du Monash University, Zhou Yang University of Alberta, Alberta Machine Intelligence Institute , Li Li Beihang University, David Lo Singapore Management University | ||
14:20 10mTalk | FGIT: Fault-Guided Fine-Tuning for Code Generation Research Papers Lishui Fan Zhejiang University, Zhongxin Liu Zhejiang University, Haoye Wang Hangzhou City University, Lingfeng Bao Zhejiang University, Xin Xia Zhejiang University, Shanping Li Zhejiang University | ||
14:30 10mTalk | Mixture-of-Experts Low-Rank Adaptation for Multilingual Code Summarization Research Papers Tianchen Yu School of Software Engineering, South China University of Technology, Li Yuan School of Software Engineering, South China University of Technology, Guangzhou, China, Hailin Huang South China University of Technology, Jiexin Wang South China University of Technology, Yi Cai School of Software Engineering, South China University of Technology, Guangzhou, China | ||
14:40 10mTalk | EfficientEdit: Accelerating Code Editing via Edit-Oriented Speculative Decoding Research Papers Peiding Wang Beihang university, Li Zhang Beihang University, Fang Liu Beihang University, Yinghao Zhu Beihang University, Wang Xu Tsinghua University, Lin Shi Beihang University, Xiaoli Lian Beihang University, China, Minxiao Li Beihang university, Bo Shen Huawei Cloud Computing Technologies Co., Ltd., Binzhang Fu Huawei Technologies, n.n. Pre-print | ||
14:50 10mTalk | Bias Testing and Mitigation in LLM-based Code Generation Journal-First Track Dong Huang The University of Hong Kong, Jie M. Zhang King's College London, Qingwen Bu Shanghai Jiao Tong University, Xiaofei Xie Singapore Management University, Junjie Chen Tianjin University, Heming Cui University of Hong Kong | ||
15:00 10mTalk | FastCoder: Accelerating Repository-level Code Generation via Efficient Retrieval and Verification Research Papers Qianhui Zhao Beihang University, Li Zhang Beihang University, Fang Liu Beihang University, Xiaoli Lian Beihang University, China, Meng Qiaoyuanhe Beihang University, Ziqian Jiao Beihang University, Zetong Zhou Beihang University, Jia Li , Lin Shi Beihang University Pre-print | ||
15:10 10mTalk | AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion Research Papers Tianyue Jiang Sun Yat-sen University, Yanli Wang Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Daya Guo , Ensheng Shi Huawei, Yuchi Ma Huawei Cloud Computing Technologies, Jiachi Chen Sun Yat-sen University, Zibin Zheng Sun Yat-sen University | ||
15:20 10mTalk | Effectiveness of symmetric metamorphic relations on validating the stability of code generation LLM Journal-First Track Chan Pak Yuen Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China, Jacky Keung City University of Hong Kong, Zhen Yang Shandong University | ||