Decoding Secret Memorization in Code LLMs Through Token-Level Characterization (ICSE 2025 - Research Track) - ICSE 2025

Sat 26 April - Sun 4 May 2025 Ottawa, Ontario, Canada

Who

Yuqing Nie, Chong Wang, Kailong Wang, Guoai Xu, Guosheng Xu, Haoyu Wang

Track

ICSE 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Fri 2 May 2025 16:15 - 16:30 at 213 - AI for Security 3 Chair(s): Tien N. Nguyen

Abstract

Code Large Language Models (LLMs) have demonstrated remarkable capabilities in generating, understanding, and manipulating programming code. However, their training process inadvertently leads to the memorization of sensitive information, posing severe privacy risks. Existing studies on memorization in LLMs primarily rely on prompt engineering techniques, which suffer from limitations such as widespread hallucination and inefficient extraction of the target sensitive information. In this paper, we present a novel approach to characterize real and fake secrets generated by Code LLMs based on token probabilities. We identify four key characteristics that differentiate genuine secrets from hallucinated ones, providing insights into distinguishing real and fake secrets. To overcome the limitations of existing works, we propose DESEC, a two-stage method that leverages token-level features derived from the identified characteristics to guide the token decoding process. DESEC consists of constructing an offline token scoring model using a proxy Code LLM and employing the scoring model to guide the decoding process by reassigning token likelihoods. Through extensive experiments on four state-of-the-art Code LLMs using a diverse dataset, we demonstrate the superior performance of DESEC in achieving a higher plausible rate and extracting more real secrets compared to existing baselines. Our findings highlight the effectiveness of our token-level approach in enabling an extensive assessment of the privacy leakage risks associated with Code LLMs.

Yuqing Nie

Beijing University of Posts and Telecommunications

Chong Wang

Nanyang Technological University

Kailong Wang

Huazhong University of Science and Technology

China

Guoai Xu

Harbin Institute of Technology, Shenzhen

Guosheng Xu

Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications

Haoyu Wang

Huazhong University of Science and Technology

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

	16:00 - 17:30	AI for Security 3Research Track / New Ideas and Emerging Results (NIER) at 213 Chair(s): Tien N. Nguyen University of Texas at Dallas

	16:00 15m Talk		GVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability DetectorsSecurity Research Track Heng Yong Nanjing University, Zhong Li , Minxue Pan Nanjing University, Tian Zhang Nanjing University, Jianhua Zhao Nanjing University, China, Xuandong Li Nanjing University
	16:15 15m Talk		Decoding Secret Memorization in Code LLMs Through Token-Level CharacterizationSecurity Research Track Yuqing Nie Beijing University of Posts and Telecommunications, Chong Wang Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Guoai Xu Harbin Institute of Technology, Shenzhen, Guosheng Xu Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications, Haoyu Wang Huazhong University of Science and Technology
	16:30 15m Talk		Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection SolutionsSecurity Research Track Satyaki Das University of Southern California, Syeda Tasnim Fabiha University of Southern California, Saad Shafiq University of Southern California, Nenad Medvidović University of Southern California Pre-print Media Attached File Attached
	16:45 15m Talk		Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention InferenceSecurity Research Track Chong Wang Nanyang Technological University, Jianan Liu Fudan University, Xin Peng Fudan University, Yang Liu Nanyang Technological University, Yiling Lou Fudan University
	17:00 15m Talk		Weakly-supervised Log-based Anomaly Detection with Inexact Labels via Multi-instance LearningSecurity Research Track Minghua He Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Chiming Duan Peking University, Huaqian Cai Peking University, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University
	17:15 7m Talk		Towards Early Warning and Migration of High-Risk Dormant Open-Source Software DependenciesSecurity New Ideas and Emerging Results (NIER) Zijie Huang Shanghai Key Laboratory of Computer Software Testing and Evaluation, Lizhi Cai Shanghai Key Laboratory of Computer Software Testing & Evaluating, Shanghai Software Center, Xuan Mao Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, China, Kang Yang Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai Development Center of Computer Software Technology