The lack of transparency regarding the code datasets used during LLM training creates substantial challenges in detecting, evaluating, and mitigating data leakage. This paper applies a perturbation-based approach to quantify the “membership advantage” of LLMs across various coding tasks by measuring the performance gap between a model’s handling of data it has likely encountered during training versus novel inputs. Our comprehensive analysis examines 8 open-source code LLMs, including StarCoder and QwenCoder, across 19 benchmark datasets spanning four distinct categories: standard code generation, code understanding, security vulnerability detection, and bug identification. The results reveal significant variations in the sensitivity patterns, with models like StarCoder exhibiting substantially higher sensitivity scores on certain benchmarks compared to models like QwenCoder, suggesting fundamental differences in their generalization process and their learned knowledge.
Interestingly, our analysis of the widely-used CVEFixes and Defects4J benchmarks, frequently suspected of data leakage in the research community, reveals unexpectedly low membership advantage scores across all models. This finding challenges prevailing concerns about this dataset’s validity for evaluating code LLMs and suggests that models may be effectively generalizing from this data rather than merely memorizing it. Other findings provide critical insights into the generalization capabilities of code LLMs and emphasize the need for more robust evaluation frameworks when evaluating model performance, particularly in domains and tasks, such as security-related ones, where data leakage may significantly lead to a false sense of reliability.