Securing LLM-based Software Supply Chains
Abstract: LLMs are increasingly used not just for autocompletion, but also for code generation from natural language and APIs and other tasks. The output they produce, however, is based on the input data that is nominally permissively licensed, but is not curated for quality, security, performance, or other factors, such as whether the code’s license is authentic. This leads to buggy, insecure, poorly performing, or inappropriately licensed output that is already poisoning the rapidly growing OSS codebase. Problematic inputs will result in problematic outputs even if all the LLM hallucinations were to be removed, hence stronger provenance tracking and quality assurance for LLM training and fine-tuning inputs is essential to improve quality of the generated code. We suggest approaches to use World of Code research infrastructure to curate LLM training data via de-duplicating and auto curating source code based on the OSS-wide software supply chain properties derived from the nearly complete collection of OSS source code.
Audris Mockus is the Ericsson-Harlan D. Mills Chair Professor of Digital Archeology and Evidence Engineering in the Department of Electrical Engineering and Computer Science of the University of Tennessee, Knoxville and Senior Scientist at Vilnius University. He studies software developers’ culture and behavior through the recovery, documentation, and analysis of digital remains, in other words, Digital Archaeology. These digital traces reflect projections of collective and individual activity. He reconstructs the reality from these projections by designing data mining methods to summarize and augment these digital traces, interactive visualization techniques to inspect, present, and control the behavior of teams and individuals, and statistical models and optimization techniques to understand the nature of individual and collective behavior.
presentation (wocllm.pptx (1).pdf) | 322KiB |
Thu 14 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
13:20 - 15:20 | SATE - Software Engineering at the Era of LLMsSATE - Software Engineering at the Era of LLMs at Room FR Chair(s): Xin Xia Huawei Technologies | ||
13:20 40mTalk | Towards Better Software Quality in the Era of Large Language Models SATE - Software Engineering at the Era of LLMs Lingming Zhang University of Illinois at Urbana-Champaign | ||
14:00 40mTalk | Securing LLM-based Software Supply Chains SATE - Software Engineering at the Era of LLMs Audris Mockus Vilnius University & The University of Tennessee File Attached | ||
14:40 40mTalk | BEWARE: some of the deep learning rhetoric is misleading SATE - Software Engineering at the Era of LLMs Tim Menzies North Carolina State University Pre-print |