PiCo: Privacy-preserving Code Sanitization for Cloud-based LLMs
Cloud-based Large Language Models are now widely adopted in various software engineering tasks, such as program comprehension, bug fix and code complement. However, developers are highly concerned about the inadvertent leakage of sensitive data contained in their code. The exposure of such information to an untrusted third-party (i.e., remote LLMs) poses significant privacy risks to developers and their affiliated institutions.
To mitigate this threat, existing code sanitization approaches heavily rely on specific keywords or regex to eliminate fixed type of sensitive data, such as username, password and API keys. Therefore, a considerable amount of context-dependent sensitive information, such as meaningful variable names, identifiers, and even algorithmic logic, falls through the crack.
In this paper, we propose PICO, a fine-grained, localized code sanitization framework specifically designed for cloud-based LLM applications (services). To minimize privacy exposure, PICO leverages on-device small language models (SLMs) to understand, and then selectively sanitize semantic information within the code. Different from existing mechanisms that selectively filter out ``sensitive information'' from scratch, PICO follows the least-privilege principle, by eliminating all semantic information that are irrelevant to the given task. In this way, PICO naturally covers the heterogeneous sensitive information that can not be labeled by pre-defined heuristics. In the meantime, PICO introduces a number of novel mechanisms to achieve a good trade-off between privacy and utility, and maintain a minimal performance overhead that fully acceptable to its users.
Evaluation on 503 code QA tasks shows that PICO effectively protects user privacy while incurring a minimal impact on QA effectiveness (i.e., with an average of 13.3% reduction). In the meantime, the adoption of PICO incurs minimal time overhead (i.e., averagely 5.61 seconds per QA task).