The availability of large-scale datasets, advanced model architectures, and powerful computational resources have led to effective code models that automate many software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories, which contain vulnerabilities, sensitive information, or code with strict licenses. A code model may memorize and produce source code verbatim, leading to potential security and privacy issues.
In this paper, we investigate an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study figures out that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,000 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models may affect the distribution of memorized contents. We identify several key factors of memorization: the model size, the length of model outputs, and the duplicates in the training data. Specifically, larger models better memorize the training data. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output’s occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. A case study shows that memorization also exists in other models that have been deployed in practice. We also make some suggestions regarding dealing with memorization in code models.
Wed 17 AprDisplayed time zone: Lisbon change
11:00 - 12:30 | Language Models and Generated Code 1Research Track / New Ideas and Emerging Results at Maria Helena Vieira da Silva Chair(s): Yiling Lou Fudan University | ||
11:00 15mTalk | Modularizing while Training: a New Paradigm for Modularizing DNN Models Research Track Binhang Qi Beihang University, Hailong Sun Beihang University, Hongyu Zhang Chongqing University, Ruobing Zhao Beihang University, Xiang Gao Beihang University Pre-print | ||
11:15 15mResearch paper | KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding Research Track Lipeng Ma Fudan University, Weidong Yang Fudan University, Bo Xu Donghua University, Sihang Jiang Fudan University, Ben Fei Fudan University, Jiaqing Liang Fudan University, Mingjie Zhou Fudan University, Yanghua Xiao Fudan University | ||
11:30 15mTalk | FAIR: Flow Type-Aware Pre-Training of Compiler Intermediate Representations Research Track Changan Niu Software Institute, Nanjing University, Chuanyi Li Nanjing University, Vincent Ng Human Language Technology Research Institute, University of Texas at Dallas, Richardson, TX 75083-0688, David Lo Singapore Management University, Bin Luo Nanjing University Pre-print | ||
11:45 15mTalk | Unveiling Memorization in Code Models Research Track Zhou Yang Singapore Management University, Zhipeng Zhao Singapore Management University, Chenyu Wang Singapore Management University, Jieke Shi Singapore Management University, Dongsun Kim Kyungpook National University, DongGyun Han Royal Holloway, University of London, David Lo Singapore Management University | ||
12:00 15mTalk | Code Search is All You Need? Improving Code Suggestions with Code Search Research Track Junkai Chen Zhejiang University, Xing Hu Zhejiang University, Zhenhao Li Concordia University, Cuiyun Gao Harbin Institute of Technology, Xin Xia Huawei Technologies, David Lo Singapore Management University | ||
12:15 7mTalk | Expert Monitoring: Human-Centered Concept Drift Detection in Machine Learning Operations New Ideas and Emerging Results Joran Leest Vrije Universiteit Amsterdam, Claudia Raibulet Vrije Universiteit Amsterdam, Ilias Gerostathopoulos Vrije Universiteit Amsterdam, Patricia Lago Vrije Universiteit Amsterdam Pre-print |